Questionnaires in User Experience Research (Query)

A list of not-so frequently asked questions (Ed: 5.1)

Compiled by: Jurek Kirakowski,
User Experience Solutions, Cork, Ireland.
This edition: 8th June, 2021.

Line

Over the years, I have seen many questions asked about the use of questionnaires in usability engineering (oops, nowadays, User Experience Research.) The list on this page is a compilation of the questions I have heard most often or should have heard; and the answers I gave, should have given, or would have given if I had thought of it first.

The purpose of this list is really to give the interested reader a flavour of what is involved in creating a questionnaire, especially in our field of user experience research; and also, to give some practical advice on how to deploy and deal with the data arising from questionnaires.

The previous editions were hosted on the computers at University College Cork, last dated the year 2001, twenty years ago. I've blown the dust off and added a bit more content.

A number of folk have given me feedback on this document, and they are gratefully acknowledged below.

Comments or questions, please email me at jzk@uxp.ie.

Line

Index of questions on this page (use the "back" button on your browser to get back to this index):

Acknowledgements.

Line

What is a questionnaire?

Questionnaires are a methodology for the elicitation, recording, and collecting of information from people. The four italicised words in this definition summarise the essence of what questionnaires are about. I can give a 50-minute lecture explaining this definition with examples and anecdotes, but the four points below summarise the gist of it.

Anne Anastasi also once famously wrote: a questionnaire is a sample of behaviour. This reflects the times she wrote in (1961, the behaviourist era in psychology.) But it also encourages us to remember that filling out a questionnaire is something that a person does in one, three, five or more minutes of their life. We want to use the data they have generously provided to generalise from this burst of activity to longer periods of their lives and lives of other like-minded persons.

  • Methodology: This suggests that questionnaire technology is a collection of a tools to be used rather than an end in itself or a work of modern art. Before you start even thinking of using a questionnaire, a useful question to ask yourself is: what do I need to know and how best can I find this out? Some kinds of information are not very reliably gathered using questionnaires (eg how often people do things, or self-reports about aspects of life where status is involved.) And it is also very useful at the start to ask yourself how will the information I am seeking give me a true picture of what I want to know?

  • Elicitation: A questionnaire may bring out information from the respondent or it may start the respondent thinking or even doing some work on their own in order to supply the requested information. In any case, a questionnaire is a device that starts off a process of discovery in the respondent's mind. People are not frozen into one state of mind as they go through a questionnaire. They think, they change.

  • Recording: The answers the respondent makes are somehow recorded onto a permanent medium which can be re-played and brought into analysis. Usually by writing, but also possibly by recording voice or video. The critical issue is that nothing, however trivial it might appear, is missed at the recording stage. You can always ignore stuff when you analyse. But the researcher, like the forensic scientist, has to leave a trail of evidence which can show to a reviewer (or to the researcher themselves in a reflective mood) how the conclusions were reached.

  • Collecting: People who use questionnaires are collectors. We are also sometimes called hoarders but that's unfortunate. Given the amount of effort involved in creating a questionnaire, if you only ever needed to use it for one respondent, chances are you'd find some more efficient method of getting the information. So in practical terms, you will have piles of data. However, unless you intend to leave these piles of data mouldering in your filing cabinet till the day you retire, you must also consider what you are going to do with them. So make sure your method of collecting will enable you to retrieve your data.
    Since questionnaires delivered by internet have become the norm these days there is a critical issue with regard to collecting data: how safe is it from snooping and corruption? Backups, security, secure deletion.

    Which brings us neatly back to the first point that a questionnaire involves the systematic use of methods.

Questionnaires are made up of items to which the respondent supplies answers or reactions.

Remember that answering a questionnaire is an activity. It focuses the respondent's mind to a particular topic and almost by definition, to a certain way of approaching the topic by the words we use and the way we present the questionnaire. We try hard to avoid bias when we construct questionnaires; when a respondent has to react to very tightly focused questions (so-called closed-ended questionnaires) bias is a real problem. When a respondent has to react to a more loose set of questions (so-called open-ended), bias is still there, but it's most probably more deeply hidden.

Are there different kinds of questions?

There are two basic types of questions:

Factual-type questions

Such questions ask about public, observable information that it would be tedious or inconvenient to get any other way. For instance, number of years that a respondent has been working with computers, or what kind of education did the respondent get. Or, how many times did the computer break down in a two-hour session, or how quickly did a user complete a certain task. If you are going to include such questions you should spend time and effort to ensure that the information you are collecting is accurate, or at least to determine the amount of bias in the answers you are getting. Don't throw the onus on the respondent (unless of course you are a very large organisation who can punish respondents for giving you inaccurate information!)

Don't be lazy. If this information is important to you, and you can get it another way, do so (but be careful to tell the respondent that you will be doing this and give the respondent the right to refuse you such data: not something that some large corporations seem to worry too much about, these days of the 2020s.)

Subjective questions

These ask the respondent to tell you something from inside themselves; some information that nobody else has access to. In contrast to factual type questions, there can be no objectively right or wrong answers to subjective questions. But error creeps in when a respondent picks up the question in a way that you (or others) have not anticipated. More egregious sources of error happen when respondents prevaricate, lie, or delude themselves; wittingly or unwittingly. We do have ways of dealing with error in responses to subjective questions, but none of them are infallible.

It may be helpful to consider subjective questions on a scale from qualitas (internal states of the respondent) to sensitivitas (reactions to externals by the respondents). In user experience research, understanding the distinction turns out to be fairly important.

  • Qualitas: this is the word given by modern philosophers to the state of being of a person, in answer to the question: "What is it like to be me (now, at this moment)?" All the stuff inside us that defines us as a person distinct from the person next to us: private events: thoughts, emotions, plans, moods, and so on. In user experience research we are most concerned about what private events are evoked inside the respondent when they use (or think about using) the technology in question.
    So we now attempt to dredge out of them their internal (sometimes fancifully known as their psychological) reaction caused by the technology. In broad terms, sensitivitas encompasses general attitudes which are private events triggered by broad categories of technology (so for instance, attitudes to the internet.) But in the narrow term we use in user experience research we are usually much more interested in the private events which are triggered by the experience of this technology used by these people in this kind of environment. Those readers who are familiar with the ISO 9241 Part 11 definition of the term usability will understand this as the element of usability generally referred to in that standard as user satisfaction (see the Preview of this standard on the linked page, or buy a copy of the standard for yourself.)

    Questions about qualitas are generalisable to quite a broad range of technology products because they relate to the internal private world of the respondent.


  • Sensitivitas: this is the strength of response (or sensitivity) directly to aspects of the world outside: what do respondents think about something or someone. There's no right or wrong answer, all we have to do is give the strength of our feeling: do we like it or not, or which do we prefer? Will we vote for Mr A or Mr B? (this is sometimes called an opinion survey.)

    Sometimes, we expect our respondents to channel their feelings in a certain direction:

    • do we find [this] feature of the software helpful?
    • Do [these] colours of the interface annoy us?
    • Can we easily use [this] interface to make a purchase?

    Such surveys do not concern themselves with subtleties of thought in the respondent, they are concerned with finding out how something affects the respondent. Sensitivia type questions direct the thought of the respondent outwards, towards people or artifacts in the world out there. Responses to such questions can sometimes be checked against actual behaviour of people, usually, in retrospect ("Wow! It turned out that those soft, flexible keyboards sold a lot less than we were led to believe!")

    Opinion polling is a complex set of methodologies and the biggest problem is always to get a sample that will represent the way that the majority of people of interest will react in terms of their eventual behaviour.

We can't directly cross-check qualia against behaviours in the way we can with factual and sensitivia type questions. However, we can check whether qualia results are internally consistent and this is an important consideration when developing attitude questionnaires.

What are the advantages of using questionnaires in usability research?

  • The biggest single advantage is that a questionnaire gives you feedback from the point of view of the user. If the questionnaire is reliable, and you have used it according to the instructions, then this feedback is a trustworthy sample of what you (will) get from your whole user population.
  • Another big advantage is that measures gained from a qualia type of questionnaire are to a large extent, independent of the system, users, or tasks to which the questionnaire was applied. They ask the respondent to refer to their inner mental states. You could therefore compare

    • the perceived usability of a word processor with an electronic mailing system,
    • the ease of use of a database as seen by a novice and an expert user,
    • the ease with which you can do graphs using a spreadsheet compared to using a specialised graphics editor.

  • Additional more general advantages are that questionnaires are usually quick and therefore cost effective to administer and to score and that you can gather a lot of data using questionnaires as surveys. And of course, questionnaire data can be used as a reliable basis for comparison or for demonstrating that quantitative targets in usability have been met.

What are the disadvantages?

  • The biggest single disadvantage is that a questionnaire tells you only the user's reaction as the user perceives the situation. Thus some kinds of questions, for instance, to do with time measurement or frequency of event occurrence, are not usually reliably answered in questionnaires. On the whole it is useful to distinguish between subjective measures (which is what questionnaires are good for) and performance measures (which are publicly-observable facts and are more reliably gathered using direct event and time recording techniques.)

  • There is an additional smaller disadvantage. A questionnaire is usually designed to fit a number of different situations (because of the costs involved of creating one, whether it be a qualia or sensitivia focused instrument.) Thus a questionnaire cannot tell you in detail what is going right or wrong with the application you are testing. But a well-designed questionnaire can get you near to the issues, and stimul;ate a thought process in the respondent's mind. A few open-ended questions can then be tagged on to elicit specific information - and I have found they frequently do.

  • Those who have worked with questionnaires for a long time in industry will also be aware of the seductive power of the printed number. Getting hard, quantitative data about user attitudes or opinions is good, but this is not the whole story. If the aim of the investigation is to analyse the overall usability of a piece of software, then the subjective data must be enhanced with performance, mental effort, and effectiveness data. In addition, one should also ask, why? This means talking to the users and observing them. Such an approach to research is called "triangulation": you get your bearings about the object of interest from different points of view.

What are the ethical issues in doing research by questionnaire?

The main issue in ethics regarding questionnaire research in user experience testing is that of informed consent. There is an extremely good document from the US Department of Health and Human Services about informed consent in clinical research on human subjects. At present, however, the European Union is considered a global front runner in setting rules for the digital sphere. The General Data Protection Regulation (GDPR) established a regime based on data protection as a fundamental human right, and set a global standard for modern privacy protection. This is the future. Although what we do with questionnaires is not clinical, it does involve human subjects and may request highly personal data from them which may harm them if it falls into the wrong hands or if collated with other public information about them.

I chaired the ethics oversight committee in the School of Applied Psychology at University College Cork for some years. We had the following questions as routine prompts to an application for ethical clearance for research involving questionnaires:

  • What do you tell your respondents before they agree to participate and what questionnaires will you use?
  • If a respondent decides not to continue or to revoke their data, do they have a safe way out?
  • Do your respondents know what you will do with the data they provide?
  • Do your respondents know who else will see this data?
  • Will there be consequences to your respondents from participating in your research?
  • If participation raises unforeseen consequences for your respondents, what resources do you have to manage this?

Don't forget that if you are using the internet, then the information flow between a respondent and your database will identify at the very least from what IP address and at what time was any disclosure made to you, the investigator. Many ordinary people are not aware of this. Unless you exercise encryption to a high standard in your database, and the https protocol at the very minimum, your data is liable to be snooped by unknown third parties. If you ask for details that may identify the respondent by collating with other public information about them, you run the risk of exposing the identity of your participants. Note that third-party questionnaire hosts are often evasive about the security they exercise over your data: read the small print carefully.

In general, I try to ensure that my questionnaire collection method (which these days is by default on-line) looses IP addresses as quickly as possible, does not ask for information which may be triangulated with other resources to point to an individual, and is securely encoded for its lifetime. If any data is kept for building standardisation databases, then I ensure that every reference to client and respondent is eliminated from the record.

If freetext information provided (unwittingly) by respondents does sound as if it might identify a person, I will redact it. My students, who sometimes collected a lot of freetext information, made it a point not to include verbatim quotations from respondents in their theses and publications. Such a policy interferes with the concept of the trail of evidence that is considered critical for scientific investigations. A number of reviewers and external examiners had to be convinced that this was the price we have to pay for upholding the ethics of informed consent.

If you need to link data to people, give each respondent to your questionnaire a unique, random password. You can keep that online. Give their online personal data a totally different unique random password. Keep an offline list (securely under lock and key!) connecting the two passwords. It is far less likely that a thief will break into your office and steal the paper list than that a theft will happen online. But it is possible. So remember that no system of security is totally foolproof.

How do questionnaires fit in with other HCI evaluation methods?

The ISO 9241 standard, part 11 (2018), defines usability in terms of effectiveness, efficiency, and satisfaction. If you are going to do a usability laboratory type of study, then you will most probably be recording user behaviour on a video or at least timing and counting events such as errors, hesitations, or questions. This is known as efficiency (performance) analysis.

You will also most probably be assessing the quality of the outputs that the end user generates with the aid of the system you are evaluating. Although this is harder to do, and more subjective, this is known as effectiveness analysis.

But these two together don't add up to a complete picture of usability. You want to know what the user feels about the way they interacted with the software. Satisfaction is defined as a subset of user experience that addresses the experience resulting from actual (or as defined in the 2018 version of Part 11, possibly also the intended) use. In many situations, this may be the single most important item arising from an evaluation! Enter the user satisfaction questionnaire.

It is important to remember that these three items (effectiveness, efficiency, and satisfaction) don't always give the same answers: a system may be effective and efficient to use, but users may hate it. Or the other way round.

Questionnaires of a factual variety are also used very frequently in evaluation work to keep track of data about users such as their age, experience, and what their expectations are about the system that will be evaluated. Expectations are considered to affect behaviour, behaviour - satisfaction, and satisfaction - expectations in a triangular relationship.

What is meant by reliability?

The reliability of a questionnaire is the ability of the questionnaire to give the same results when filled out by like-minded people in similar circumstances. Reliability is usually expressed on a numerical scale from zero (very unreliable) to one (extremely reliable.)

What is meant by validity?

The validity of a questionnaire is the degree to which the questionnaire is actually measuring or collecting data about what you think it should be measuring or collecting. Note that not only do opinion surveys have validity issues; factual questionnaires may have very serious validity issues if for instance, respondents interpret the questions in different ways. In contrast to validity, which has a fairly well defined set of statistical methods and recognised standards to guide the analyst, there are many different interpretations of the concept of validity. Beware of the circular definition: "intelligence is what IQ tests measure, IQ tests measure intelligence."

Should I develop my own questionnaire?

If you have a lot of time, patience, and resources, then go right ahead. You are well advised to do a course in psychological measurement, including a heavy dose of statistics, and to become sensitive to language use in your community beforehand. You should also try to gain experience with administering and interpreting questionnaires that have already been devised, for purposes outside of usability evaluation (as well as those devised specifically for usability evaluation.) You should ensure that your questionnaire has adequate reliability and validity and that you have an idea of what the expected values are, so you can assign a meaning to the score which a particular technology product has attained.

If this list of qualifications sounds ominous to you, then take the sensible option: use a questionnaire that has already been developed and standardised by someone else, and look for the published references to the questionnaire. It takes about six months of hard work to create a questionnaire that can provide interpretable results. I can attest this from many years of my work and from observing and supervising the work of many bright, talented and highly motivated students.

What's wrong with putting a quick-and-dirty questionnaire together?

The problem with a quick-and-dirty questionnaire is that you usually have no notion of how reliable or valid the questionnaire is. You may be lucky and have developed a very good questionnaire or you may be unlucky. However, until you put your work through the intensive statistical and methodological procedure involved in creating a questionnaire, you just won't know.

A poor questionnaire will be insensitive to differences between versions of software, releases, etc. and will not show statistically significant differences. It will not show improvement in your processes. You are then left in a quandary: does the questionnaire fail to show differences because they do not actually exist, or is it simply because your questionnaire is insensitive and unreliable? If your questionnaire does show differences, is this because it is measuring something other than what you think it should be, or is it because you are actually getting better?

Quick-and-dirty questionnaires do not usually have a database of expected values behind them. But unless you have such a database of expected values, you will not know how to interpret the outputs from your questionnaire. Is a score of 75/100 good? Do many software products achieve this level of satisfaction?

The crux of the matter is: you can't tell unless the questionnaire has been through the standard development and test process.

Factual-type questionnaires are easy to do, though, aren't they?

A factual, or "survey" questionnaire is one that asks for relatively straightforward information and does not need personal interpretation to answer. Answers to factual questions can be proven right or wrong. A subjective questionnaire is one that asks the respondent what they think of something. An answer to a subjective question cannot be proven right or wrong: it is simply the opinion of the respondent and is inaccessible to independent verification, although you may wish to look at the subsequent behaviour of the respondent to see how straight their answer was.

Although it is important to check that the respondents understand the questions of both kinds of questionnaires clearly, the burden of checking is much greater with subjective questionnaires because we cannot sanity check the answers against reality.

What's the difference between a questionnaire which gives you numbers and one that gives you free text comments?

A closed-ended questionnaire is one that leaves no room for individual comments from the respondent. The respondent replies to a set of questions in terms of pre-set responses for each question. These responses can then be coded as numbers. An open-ended questionnaire requests the respondent to reply to the questions in their own words, maybe even to suggest topics to which replies may be given. The ultimate open-ended questionnaire is a diary study in which respondents write abut their experiences (ok, perhaps with a few hints here and there as to what to write about!) Slightly more structured is a "critical incident" type of questionnaire in which respondents explain several good or bad experiences, and the circumstances which led up to them, and what happened after, all in their own words.

  • Closed-ended questionnaires are good if you are going to be processing massive quantities of data, and if your questionnaire is appropriately scaled to yield meaningful numeric data. If you are using a closed-ended questionnaire, however, encourage the respondents to leave their comments either in a special space provided on the page, or in the margins. You'll be surprised what this gives you.

  • Open-ended questionnaires are good if you are in an exploratory phase of your research or you are looking for some very specific comments or answers that can't be summarised in a numeric code.

Can you mix factual and opinion questions, closed and open ended questions?

It doesn't do to be too purist about this. It's a good idea to mix some open-ended questions in a closed-ended opinion questionnaire and it's also not a bad thing to have some factual questions at the start of an opinion questionnaire to find out who the respondents are, what they do, and so on. Some of your factual questions may need to be open-ended, for instance if you are asking respondents for the name of the hardware they are using.

This also means you can construct your own questionnaire booklets or web pages by putting together a reliable opinion questionnaire, for instance, and then add some factual questions at the front and maybe some open ended opinion questions at the end.

How do you analyse open-ended questionnaires?

The standard method is called "content analysis" and is a subject all of its own. In content analysis you boil down responses into categories, and categories into types. Then you can count the frequency of occurrence of different types of response.

What is a Likert-style questionnaire? One with five response choices to each statement, right?

No indeed not. Rensis Likert is a man who wrote an influential article in 1932 about a simple way of measuring attitude. This type of scale has been called by his name ever since, although Likert himself published virtually nothing else about it in a long and productive life. But a lot of research and practice by many others ensued, greatly evolving the concept to what is now known as classical test theory (CTT). A Likert-style questionnaire is one in which you have been able to prove that each item of the questionnaire has a similar psychological "weight" in the minds of the respondents, and that each item is making a statement about the same idea. Likert scaling is quite tricky to get right, but when you do have it right, you are able to sum the scores on the individual items to yield a questionnaire score that you can interpret as differentiating between shades of opinion from "completely against" to "completely for" the construct you are measuring.

It is possible to find questionnaires which seem to display Likert-style properties in which many of the items are simply re-wordings of other items. Such questionnaires may show some fantastic reliability data, but basically they're a cheat because you're just adding in extra items that bulk up the statistics without telling you anything really new.

And of course there are plenty of questionnaires around which are masquerading as Likert-style questionnaires but which have never had their items tested for any of the required Likert properties. Summing item scores of such questionnaires is just nonsense. Treat such questionnaires as checklists (see below) until you are able to do some psychometric validation on them.

How can I tell if a question belongs to a Likert scale or not?

The essence of a Likert scale is that the scale items, like a shoal of fish, are all of approximately the same size, and are going in the same direction.

People who design Likert scales are concerned about developing a batch of items that all have approximately the same level of importance (weight) to the respondent, and are all more or less talking about the same concept (direction), which concept the scale is trying to measure. Designers use various statistical criteria to quantify these two ideas.

To start with, we canvass opinions which we can then cast as statements about the concept we are trying to measure. It's important to get a broad range of opinions to avoid creating holes in the validity of your questionnaire. It's easy to throw out questions later, but not so easy to put in new questions once you have started the process.

There are various shades of opinion as to how to word the statements or questions. Issues that arise often are:

  • Should we avoid words like "generally" or "usually"? Most commentators advise that statements should be definite and avoid vagueness in their wording.
  • Should we avoid negative statements? Opinion is divided on this topic as well - see below.
  • Do avoid colloquialisms and turns of phrase which are not literally interpretable. A good challenge is to ask someone to translate your statements into another language and to see where they have problems conveying the meaning.
  • Do avoid statements which will put the respondent in a bad light or that will result in their slandering or offending other groups of people.
  • Do avoid statements which are objectively and without a doubt either true or false.

So now we have to get a bunch of people to fill out the first draft of the questionnaire we are trying to design. We should ideally have about 100 respondents with varied views on the topic we are trying to measure, and certainly, more respondents than questions. We then compute various statistical summaries of this data. I hate to have to say this here, but you should also ensure that a variety of software, web sites, or whatever you are developing the scale for has been sampled. Different objects will result in different patterns of response.

The first summary I encourage is to get a count of the number of times each response option has been selected in the data set for each question. There should be an even number of votes on each side of the central point (so, on a two-choice response option, roughly as many "agrees" as "disagrees"; on a three- or five- response option, as many on one side of the middle point as on the other.) This is to eliminate questions to which the answers are invariably strongly positive or negative. With respect to mothers, these are sometimes called "motherhood" statements, because, let's face it - motherhood is considered to be a good thing. And as for statements with which everybody disagrees? You can make up your own name for those. Beware also, the statements to which everybody selects the middle option. Although the numeric average for questions like these will be comfortably in the middle (hooray!), basically this pattern is just telling you that nobody can decide whether they agree or disagree (boo!) Just computing averages on each question will hide this regrettable phenomenon.

If you have questions which you think make a lot of sense but which have some of these strange response profiles, STOP RIGHT THERE. You need to think again why this is happening: consult your friends, some of your respondents, try out different wordings on different people... and issue the next iteration without going any further. If the problem persists over iterations, then ditch the question(s), to be sure. But questions with uneven response profiles will produce strange results when you come to further analysis: this is because after this stage, all our statistical analysis procedures assume that the data comes from a normally distributed population of data. Or at least, one that is fairly evenly distributed between the extremes. Yes, gentle reader. This assumption includes the computation of the average, or the humble arithmetic mean of a sample. See what happens when you include a millionaire's child in a sample of average student disposable incomes.

So, presuming that most of our questions are fairly well behaved, we ask: do the items all have the same level of importance to the respondent? To measure this we look at the reliability coefficient of the questionnaire. If the reliability coefficient is low (near to zero) this means that some of the items may be more important to the respondents than others. If the reliability coefficient is high (near to one) then the items are most probably all of the same psychological "weight."

Are the items all more or less talking about the same concept? To measure this we look at the statistical correlation between each item and the sum of the rest of the items. This is sometimes called the item-whole correlation. Items which don't correlate well with the rest of the items are clearly not part of the scale (going in a different "direction") and should be thrown out or amended.

It's fascinating to use an interactive statistical package and to watch how reliabilities and item-whole correlations change as you take items in and out of the questionnaire.

Factor Analysis

A more complex statistical procedure is called factor analysis. There are a number of variants of this procedure, and most statistical packages nowadays offer them. The bones of this procedure are that it is hypothesised that there are one or more invisible factors which determine the responses of users to the actual questions. Each question is said to "load" onto one or more of these invisible factors. It is up to the analyst to decide whether there is really only one invisible factor underlying the data, or whether there are two, or more. This is never an open-and-shut case which can be decided on purely statistical grounds. You have to look at the wordings of the questions as well as their loadings on the invisible factors that come up. What often happens is that there is one factor on which a lot of the questions load quite heavily, and then a slew of factors onto which the questions load with less and less weight. The analyst decides how many questions to retain on the basis of a good weight (whatever that weight may be - again, nothing set in concrete here.)

The last judgement

It is important to remember that statistical criteria are not the final judgement. When a set of questions has emerged from the statistical analysis, the analyst must review both them and any questions that have been rejected and decide on the basis of their professional judgement:

  • what questions can be passed on to the second iteration,
  • what questions may need a little tweaking, and
  • what questions are really not adding much to the information the questionnaire provides.

Do not abdicate responsibility to your statistical methods! And now you see why iteration is a critical part of the process. The questionnaire that results from one iteration is a hypothesis. Hypotheses have to be tested.

A very real risk a developer runs when constructing a scale is that they start to "model the data." That is, they take items in and out and they compute their statistics, but their conclusions are only applicable to the present sample of respondents. What the developer must do next is to try the new questionnaire (with items re-worded and rejected items thrown out) on a fresh sample, and re-compute all the above statistics again. If the statistics hold on the fresh sample, then well and good. If not, then more analysis and another run will be needed.

Warning: one sometimes sees some very good-looking statistics reported on the basis of analysis of the original sample, without any check on a fresh sample. Take these with a large pinch of salt. The statistics will most probably be a lot less impressive when re-sampled.

In general, in answer to the question: is this a real Likert scale or not, the onus is on the person who created the scale to tell you to what extent the above criteria have been met. If you are not getting this level of re-assurance from the scale designer, then it really is a fishy business. And beware: a scale item which may work very nicely in one questionnaire may be totally out of place in another.

How many response options should there be in a numeric questionnaire?

There are two sets of issues here. One is, should we have an odd or even number of response options. The general answer to give here is that, if there is a possibility of having a "neutral" response to a set of questions, then you should have an odd number of questions with the central point being the neutral place. On the other hand, if it is an issue of whether something is good/bad, male/female (what we can call bi-polar) then basically, you are looking at an even number of response options.

If you wish to assess the strength of the response you are actually asking two questions in one: firstly, is good or bad, and secondly, is it really very good or very bad. This leads you to consider more than two (or three) response options.

Some people use even numbers of response options to "force" the respondents to go one way or another on an issue which is in fact not at all bi-polar. What happens in practice here is that respondents end up giving random responses between the two middle items. Not very useful.

Odd numbers of response options somehow feel more natural: the central value feels like a dividing line between positive and negative, and subjective issues are usually not considered bi-polar in our civilisation. There is always the possibility of a state of doubt about one's feelings. This state of doubt may also be exacerbated by the respondent not understanding the question, or judging that the question does not actually correspond to any experience they may have had, or that the question is not relevant. We do hope that by the time the questionnaire has emerged from the development process that these issues have been dealt with, and that questions are all concise, clear, and to the point (but sometimes, maybe not, so, back to the drawing board?) So usually we hope that the central response option feels like the dividing line, on which some respondents may feel they are allowed to settle - but that most won't want to.

The other set of issues is how wide should the response options be. A scale of 1 to 3, 1 to 5, or even 1 to 100? The usual answer is five. But this is missing the point. It depends on how accurately can the majority of respondents distinguish between flavours of meaning in the questions. If you suspect that the majority of respondents are going to be fairly uninformed about the topic or vague in their judgements, then stick with a small number of response options. If you are going to be dealing with people who can give nuanced responses, then you can use a much larger set of response options.

A sure way of telling if you are using too many response options is to listen to the respondents talking after they have done the questionnaire. When people have to differentiate between fine shades of meaning that may be beyond their ability, they will complain that the questionnaire was "long" and "hard."

How do you get numbers out of a Likert questionnaire?

A Likert scale will consist of a number of statements (let's say, for the sake of an example, 10 statements, why not.) Each statement will have a response surface - a set of boxes labelled from "Agree" or "Strongly Agree" to "Disagree" or "Strongly Disagree". Let's continue for the sake of our example that we have a 5-choice response surface.

We code the "Strongly Agree" response as a 5, down to the "Strongly Disagree" response as a 1, for each statement. If some of the statements are negative then we do those statements the other way round. That is, an extreme positive response ("Strongly Agree" to a positive statement or "Strongly Disagree" to a negative) is always a 5 and an extreme negative is always a 1. That makes sense, doesn't it? The stronger, the larger. We now add up the numbers for each respondent. In our example scale, we will have numbers between 10 (that is, all 1s, "Strongly Disagrees") and 50 (all "Strongly Agrees".)

A bit of jargon here: 10 is the base of our questionnaire, and 50 - 10 = 40 is the range.

Now, if we score all our respondents, we should find a distribution of scores between 10 and 50, hopefully with most scores lying somewhere in the middle, and extreme scores being fairly rare. There are many definitions of what a "normal distribution" should look like (and tests galore to determine normality), but for the jobbing analyst, the best definition is a distribution with a bump in the middle and symmetrical fringes at either end. If your distribution doesn't look like that then it might be an accident of sampling, or - the questionnaire itself is flawed (attitude questionnaires usually exhibit a "ceiling effect" - that is, the data tends to get squeezed up into the high end.) Careful questionnaire development should mitigate against consistent "floor" or "ceiling" effects, so the chances are if you get a skewed distribution from a sample, this is what statisticians would call an accident of your sampling (ie, the software you are measuring is actually considered to be very good by your respondents.)

So you are perfectly entitled to compute an arithmetic mean (average) of your data, over all your respondents. This is your numeric summary. But to be honest, this does not look very user-friendly. So we have to do some transformations to get this numeric summary into a more palatable form. See the next section.

How do you transform numeric data?

There are at least four ways of transforming data into a more palatable form. These are:

  • Straight linear
  • Percentile rank from data
  • z-scores
  • Percentile rank from parameters

Let's tackle each in turn.

Straight linear

This is the simplest kind of transformation but it is really only number magic. The following example shows you how you can transform data from a 10 - 50 distribution to a 0 - 100 distribution for a questionnaire with 10 statements. Please remember that this is not a percentile transformation. It is not a percentage. The number 50 has no specific meaning.
We do the following for each respondent's data, after we have summed it to give us a value we will call X. The process will give us a value we will call T, or transformed value.

We first of all note the base, which is what happens when the respondent has scored 1 for each statement. In our example, base = 10.

We note the range which is the difference between the base and the situation where a respondent has scored 5 for each statement. In our example, range = 50 - 10 = 40.

Suppose the sum of our respondent's statements is 35, that is, X = 35.

    T = ((X - base) / range) * 100 = ((35 - 10) / 40) * 100 = 62.5 
We can check that if X = 10, T comes to zero. And if X = 50, T comes to 100. But do note that we do not have all the numbers between 0 and 100. The difference between Ts when X = 10 and X = 11 is 2.5. That is, we go up the scale from zero to 100 in steps of 2.5.

This slightly embarrassing feature can get neatly hidden if we take the data from many respondents, and average their T values. Hey presto! It looks as we have a scale that uses all the numbers between zero and 100.

Percentile rank from data

Using this method, we scale our data in terms of the percentile rank achieved by each respondent with reference to the collection of respondents we have sampled. Statisticians define a percentile as a score at or below which a given percentage of the rest of the data falls. So if a respondent is right in the middle, their percentile rank is 50. Percentile ranks are easy to understand (who doesn't understand a score of 100%, right?) but they have some undesirable mathematical properties, which I summarise at the end of this section.

It's easy to compute with a spreadsheet if you don't have much data. Let's say you have n items of data, in our case, n = 10. First of all you order your data from lowest value to highest (Col. X, below), and then assign a rank so that the lowest is a rank of 1. It doesn't matter if some of your data items are repeated at this stage (Col R, below.) Next, if there are repeated data items (as there surely will be!) you assign the rank to each bunch of repeated data items the highest rank of the repeated bunch (Col R', below.) Finally you compute PR. The example is given for computing the PR of a score of 30 in our sample data set

  PR  =  R'/n  =  7/10  =  0.7
Percentile ranks are usually expressed as whole numbers between zero and 100, so we multiply the result by 100. Thus a raw score of 30 produces a percentile rank of 70%.
               X	R	Rí	PR 
               10	1	2	20
               10	2	2	20
               11	3	3	30
               12	4	4	40
               15	5	5	50
               30	6	7	70
               30	7	7	70
               50	8	10	100
               50	9	10	100
               50	10	10	100
One important warning. As you can see, the numeric difference between the percentile ranks will not reflect the actual differences in the corresponding raw data values. So the difference between a raw score of 10 and 11 in percentile rank terms is 10. But so is the difference between a raw score of 12 and 15! In fact your percentiles will vary greatly the more data you collect. This makes percentiles non-linear. One consequence is that computing averages and standard deviations on percentiles is frowned upon by the statistical community. You should compute medians and other rank order statistics instead.

Some analysts prefer to count the number of percentiles in different ranges, so in our sample above you could say that only 3 / 10 cases (33%) were above the 75th percentile, or were worthy of an "A" rating. The dividing lines are, of course, extremely arbitrary. Let the buyer beware.

If you have a very large collection of data from your questionnaire (counted in the thousands, preferably) you can actually compute a percentile for every value from base to the maximum and use this data as a way of re-expressing your raw data for any one sample just by looking up the table. Although this is better than working from tiny samples, please note that this data too is subject to change as your big reference data set changes. The distribution of the big reference data set will only reflect the conditions under which it was collected.

z-scores

A standard normal score (sometimes called a z-score) is the distance from the mean of a score expressed in units of the standard deviation. All the caveats about the normality of the population distribution from which our sample is obtained apply. None the less, do remember that accidents of sampling can produce some extremely strange sample distributions, and that for the kinds of sample sizes we normally deal with (what statisticians call "small samples", ie, less than 5,000), all tests for "normality of distribution" are actually pretty vague. Either you trust that the questionnaire you are using is capable of producing normally distributed data or you don't. If you don't, don't use this method.

I'm using the same dataset as in the previous section, only this time I've used my spreadsheet to compute the mean (called AVERAGE() in most spreadsheets) and the standard deviation with a divisor of n - 1 (usually called STDEVP() in most spreadsheets: please consult an introductory statistics textbook for an explanation of this strange process of using n - 1 as a divisor.) In this example, the MEAN is 26.80 and the standard deviation (or StDevP) is 16.72. Thus the z-score of the raw score of 30 is:

  z  =  (MEAN-X)/StDevP  =  (26.80-30)/16.72   =   0.19
The computation of z-scores for the entire sample is given below. However, z-scores with their means of zero and standard deviations of 1 are not very eye-catching, and so for presentation purposes I advise transforming z-scores into a 50/10 distribution: that is, one in which the mean is 50 and the standard deviation is 10. Do consult a statistics textbook about the implications of this kind of re-scaling. To go from z-scores to this distribution the reverse of the previous formula is used. That is, for a z-score Z = 0.19 corresponding to a raw score of 30, the transformed 50/10 score, or X', with NEW_MEAN = 50 and NEW_SD = 10 is:
  X'  =  (Z*NEW_SD)+ NEW_MEAN  =  (0.19 * 10) + 50 = 51.91
The worksheet is presented here:
          X	 Z	50/10
         10	-1.00	39.97	Mean =		26.80
         10	-1.00	39.97	StDevP = 	16.76
         11	-0.94	40.57			
         12	-0.88	41.17	New Mean =	50
         15	-0.70	42.96	New SD =	10
         30	 0.19	51.91			
         30	 0.19	51.91			
         50	 1.38	63.85			
         50	 1.38	63.85			
         50	 1.38	63.85			
				
 Mean    26.80	0.00	50.00			
Sd Dev P 16.76	1.00	10.00			

Now, if you have a very large database from the questionnaire (counted in the thousands of course) you can consider this as an estimation base and take from it the population parameters: the two parameters of interest being the population mean (PM) and the population standard deviation (PSD.) You are now allowed to use these values instead of your sample mean and sample standard deviation, the StDevP in computing the value of Z - from which you can compute the value of the 50/10 distribution. This now has the advantage of showing your audience an implicit comparison between the data you have obtained and the data you have accumulated in the "shopping basket" of your estimation base. Does you evaluation show that you are above or below the overall standard? If over 50, then yes, above. If over 60, then very clearly above (see an introductory statistics textbook for how to interpret standard deviations in terms of probabilities.)

I do stress however that the techniques shown in this section will give a false reading if the questionnaire you are using does not have a normal distribution as evidenced by its large estimation base (if the estimation base is in the thousands, then tests for normality of distribution can begin to apply - with some caution.) If you don't have this kind of data my advice is: stick to percentile ranks.

Percentile rank from parameters

The procedure outlined in this section really does depend on the estimation base of the questionnaire exhibiting normal distribution properties. If it hasn't been shown to do so, then this procedure will simply create nonsense.

Instead of computing percentiles from the actual sample, you can convert the sample to z-scores using the population mean and standard deviation from the estimation base. If you then want to covert it to percentile ranks you then use a function usually called NORMSDIST() on most spreadsheets.

If you had a raw score of 30, and the population mean is given as 25 and the standard deviation as 12, the computation of the Population Percentile Rank (PPR) is as follows:

    PPR  =  INT(NORMSDIST((30-25)/12)*100 )  =  66
NORMSDIST usually produces values between zero and 1, so we have to scale the output up to 100, and then since it is a (population) percentile rank, we truncate it to the nearest whole number.

All the reservations about percentile ranks computed from samples as stated above apply here as well. Be warned!

Help! Some of my respondents have missed out some questions!

As Oscar Wilde has Lady Bracknell say in his play The Importance of Being Earnest: "To lose one parent, Mr. Worthing, may be regarded as a misfortune; to lose both looks like carelessness." Do try and ensure that your respondents don't miss out questions. In my online questionnaires, every question is obligatory (although I sometimes have freetext questions, usually of a personal nature, that need not be responded to.) I have only once been told that a University's Ethics Committee demanded that respondents be allowed to miss out answering some questions if they pleased. So I relaxed the requirement, exhorted all to fill out everything please - and nobody missed anything anyway.

The reason for this requirement is that if your respondents miss out answering questions the statistical basis of your questionnaire begins to weaken by unknown amounts - and you are really in the dark as to why a question was missed. You can assign a "missing values" code and hope for the best. Another strategy is to copy the sample data to a new file, and replace all missing values with the central (neutral) option. Or you can simply delete any respondent with missing data in the new file. Your call, but remember to keep the original as your "trail of evidence".

In any case, go back to the original, and extract all the items with missing values. Look at them. Are they telling you something about your questionnaire, or your respondents... or whatever you are evaluating?

How many anchors should a questionnaire have?

The little verbal comments above the numbers ("strongly agree", etc.) are what we call anchors. In survey work, where the questions are factual, it is considered a good idea to have anchors above all the response options, and this will give you accurate results. However, the actual wording of the anchors is extremely important, as important as the wording of the questions. People may interpret the wordings of the anchors differently. When my students used to develop questionnaires with anchor points all individually labelled, I would encourage them to go round with each set of anchor points written on a card, and to enquire of potential users (well, all right, other students) whether the anchor points actually made a progression as expected. There were always a few surprises.

In contrast, in opinion or attitude work, you are usually asking a respondent to express their position on a scale of feeling from strong agreement to strong disagreement. Although it would be helpful to indicate the central (neutral) point if it is meaningful to do so, having numerous anchors may not be so important. Indeed, some questionnaires on attitudes have been proposed with a continuous line and two end anchors for each statement. The respondent has to place a mark on the line indicating the amount of agreement or disagreement they wish to express. Such methods have been around a long time, but not taken up by most questionnaire designers.

A related question is, should I include a "no answer" option for each item. This depends on what kind of questionnaire you are developing. A factual style questionnaire should most probably not have a "no answer" option unless issues of privacy are involved. If in an opinion questionnaire, many of your respondents complain about items "not being applicable" to the situation, you should consider carefully whether these items should be changed or re-worded.

In general, I tend to distrust "not applicable" boxes in questionnaires. If the item is really not applicable, it shouldn't be there in the first place. If it is applicable, then you are simply cutting down on the amount of data you are going to get. But this is a personal opinion.

Should I place the positive "Agree" on the left or right?

This is largely a matter of opinion and sometimes corporate policy. But whatever you do, try to be consistent between your questionnaires!

I have always put my positive "Agree" box on the left hand side of the response surface, so the boxes go

[Strongly Agree] [Agree] [Undecided] [Disagree] [Strongly Disagree]
But you may want to do it the other way - why not? It really depends on how your audiences will interpret the act of responding to questions.

I usually place the line with the anchors at the top of the columns of response boxes and make sure the response boxes are all neatly aligned under each word in the anchor line. If you know you will have a page turn, or if your screen will advance to beyond the line where the anchors are given, make sure you repeat the anchors. It is a nice feature to group the questions in tens (if you have that many questions!) and to repeat the anchor line at the start of each block. Take a look at how my SUMI questionnaire is laid out - and see if you can improve on it.

My respondents are continually complaining about my questionnaire items. What can I do?

People always complain. It's a fact of life. And everybody thinks of themselves as a "questionnaire expert." If you get the odd grumble from your respondents, this usually means that the person doing the grumble has something extra they want to tell you, beyond the questionnaire. So listen to them.

If you get a lot of grumbles, this may mean that you have badly miscalculated and it's time to go back to the drawing board. When you listen to people complaining about a questionnaire, listen carefully: are they unhappy about what the questionnaire is attempting to measure, or are they unhappy about the wordings of some of your items?

What other kinds of questionnaires are there?

You mean, what other kinds of techniques can you employ to construct a questionnaire? There are two main other varieties:

  1. Semantic differential type questionnaires in which the user is asked to say where their opinion lies between two anchor points which have been shown to represent some kind of polar opposition in the respondent"s mind
  2. Guttman scaling type questionnaires which are a collection of statements which gradually get more extreme, and you calculate at what statement the respondent begins to answer negatively rather than positively.

Of the two, semantic differential scales are more frequently encountered in practice, although they are not used as much as Likert scales, and professionals seem to have relegated Thurstone and Guttman scaling techniques into the research area. As a footnote, Likert starts his famous article by complaining about how difficult it is to get Thurstone scaling right.

There is also a (third) set of methodologies collected together under the title of Rasch Measurement. This set of methodologies is rarely used in User Experience work, mainly because it involves a lot of intensive statistical computation, and it does not produce easily interpretable results. In contrast, Classical Test Theory as presented here is intuitive to most people (once you get past the tricky business of actually developing the scale!)

Should questions be devised so they are always positive?

The jury (as always) is out on this one with claims and counter-claims. The one thing to avoid most strenuously is to frame questions with an explicit negative. Saying "no" to a negative question involves mental contortions. The reason for having both negative and positive statements in questionnaire is because it is feared that response bias will come into play. If all your statements are positive up-beat ones, a respondent can simply check off all the "agrees" without having to consider each statement carefully. So you have no guarantee that they've actually responded to your statements -- they could be working on "auto-pilot". Of course, such questionnaires will also produce fairly impressive statistical reliabilities, but again, that could be a cheat.

However, it has also been mooted that some users will get confused with a reversal of direction and will always hit what they think is the "agree" button even when the statement expresses a negative opinion which they don't want to endorse. I sometimes look at the response patterns of users. If there is a considerable number of responses on the left or the right, irrespective of the question asked, I presume the respondent has got themselves hopelessly lost with regard to the direction of statements. So I usually delete their record (with suitable warnings.) The interesting thing is that this does not happen very often at all in my experience.

Is a long questionnaire better than a short one? How short can a questionnaire be?

You have to ensure that you have enough statements which cover the most common shades of opinion about the construct being rated. But this has to be balanced against the need for conciseness: you can produce a long questionnaire that has fantastic reliabilities and validities when tested under controlled conditions with well-motivated respondents, but ordinary respondents may just switch off and respond at random after a while. In general, because of statistical artifacts, long questionnaires will tend to produce good reliabilities with well-motivated respondents, and shorter questionnaires will produce less impressive reliabilities but short questionnaires may be a better test of overall opinion in practice.

A questionnaire should not be judged by its statistical reliability alone. Because of the nature of statistics, especially the so-called law of large numbers, we will find that what was only a trend with a small sample becomes statistically significant with a large sample. This is as true of the number of respondents you have as it is of the number of questions in your questionnaire. Statistical "significance" is a technical term with a precise mathematical meaning. Significance in the everyday sense of the word is a much broader concept.

So high statistical reliability is not the "gold standard" to aim for?

If a short (say 8 - 10 items) questionnaire exhibits high reliabilities (above 0.85, as a rule of thumb) then you should look at the items carefully and examine them for spurious repetitions. Longer questionnaires (12 - 20 items) if well constructed should yield reliability values of 0.70 or more.

I stress these are rules of thumb: there is nothing absolute about them.

What's the minimum and maximum figure for reliability?

Theoretically, the minimum is 0.00 and the maximum is 1.0. Suspect a questionnaire whose reliability falls below 0.50 unless it is very short (3-4 items) and there is a sound reason to adopt it.

The problem with questionnaires of low reliability is that you simply don't know whether they are telling you the truth about what you are trying to measure or not. It's the lack of assurance that's the problem.

Can you tell if a respondent is lying?

The polite way of saying this, is, can you tell if the respondent is giving you "socially desirable" answers. You can, but the development of a social desirability scale within your questionnaire (so-called "lie scale") is a topic all of its own. "Lie scales" work on the principle that if someone is trying to make themselves look good, they will also strongly agree to an inordinate number of statements that ask about impossible behaviours, such as

  • "I have never been late for an appointment in my life."
  • "I always tell the truth no matter what the cost."
Now, some respondents may strongly agree with some of these items but they'd have to be a saint to be able to honestly agree to all of them.

Lie scales just bulk up a questionnaire and are generally not used in HCI. If you are really concerned with your respondents giving you socially desirable answers, you could always put a social desirability questionnaire into the test booklet and look hard at those respondents who give you high scores on social desirability.

Why do some questionnaires have sub-scales?

Suppose that the overall construct you are getting the respondents to rate is complex: there are different components to the construct. Thus for instance, overall user satisfaction is a complex construct that can arise from a number of separate issues, like "attractiveness of product", "helpfulness", "feelings of efficiency" and so on. If you can identify these components, it makes sense to create a number of sub-scales in your questionnaire, each of which is a mini questionnaire in its own right, measuring one component, but which also contributes to the overall construct.

How do you go about identifying component sub-scales?

The soundest way of doing this is to carry out a statistical procedure called factor analysis on a large set of questions, to find out how many underlying (latent) factors the respondents are operating with (see above on factor analysis), but often, received opinion or expert analysis of the overall construct may be used instead. Or even reading the statements carefully in a group discussion activity! The crucial questions are:

  1. Are these factors truly independent? That is, if they are, we would expect items that make up the factors to be more highly correlated with each other than with items from other factor scales.
  2. What use can the analyst make of the different factors? Extracting a bunch of factors that actually contributes little to our understanding of what is going on is pseudo-science. On the other hand, separating factors which are fairly highly inter-correlated but which make sense to separate out practically makes for a more usable questionnaire. For instance, "screen layout" and "menu structure" are two factors which may be fairly strongly inter-correlated in a statistical sense but separately they may give the analyst useful information about these two aspects of an interface.

How much can I change wordings in a standardised questionnaire?

In general, if a questionnaire has been through the standardisation process the danger in changing, deleting, or adding items is that you undo the statistical basis for the questionnaire: you set yourself back by unknown amounts. You are generally advised not to do this unless you have all the background statistical data and have access to user samples on which you can re-validate your amended version.

There is one general exception. If statements in the questionnaire refer to something like "this system" or "this software" you can usually change these words to refer explicitly to the system you are evaluating without making too much damage to the questionnaire. For instance:

  • (1)  Using this system gives me a headache.
  • (2)  Using Word-Mate gives me a headache.
Changing (1) to (2) is called focusing the questionnaire and is usually no problem.

You may think to do a more radical change of focus, hopefully without affecting the statistical properties too much. Suppose for instance you were to change all occurrences of (3) to (4):

  • (3) using this system...
  • (4) configuring this system..."
...you should examine the result very carefully to check that you are not introducing shifts of meaning by doing so. If the questionnaire you are intending to change has an associated database of reference values, then changing the focus like this is most probably not a good idea if you want to still use the database of reference values.

What's the difference between a questionnaire and a checklist?

A checklist is simply a list of statements or features that it may be desirable or undesirable to have. It is not usually a scale in the psychometric sense of the term. A checklist is not amenable to Likert scaling, for instance, so summing the items of a checklist does not make sense. As an example, consider a checklist for landing a plane. You may have 95% of the items checked, but if you haven't checked that the wheels are down, your landing will be eventful. But if you haven't checked that the passengers have put their safety belts on, the consequences may not be nearly as grave (unless of course the wheels are also still up.)

Individual items within a checklist may be averaged across users, so you can get a percentage strength of agreement on each item (thus you are trying to establish truth by consensus) but even then, an expert's opinion may outweigh an averaged opinion of a group of less well informed users (a class of 30 children may decide by vote that a hamster is female, for instance, but an expert may have to over-ride that opinion after a detailed inspection.)

Where can I find out more about questionnaires?

There are books: books which devote a chapter to Likert scaling and then urge you to go out and try doing a questionnaire yourself. As with my Vergilius, Robert Burton, "I could not choose but to make some little observation... not to scoff or laugh at all, but with a mixed passion." (Anatomy of Melancholy, 1628.) It's good to get started of course, but beware that there's a long road ahead before you get the skills to do the job properly. I have spent over 40 years teaching and constructing questionnaires. As I used to say to my graduate students: you have to earn your bones (alluding to Napier's bones, not the American criminal underworld.)

Here is a minimalist list of reference sources for questionnaire construction that I have found useful as teaching material.

Aiken, Lewis R., 1996, Rating Scales and Checklists. John Wiley & Sons, NY. ISBN 0-471-12787-6. Good general introduction including discussions of personality and achievement questionnaires.

Czaja, Ronald, and Johnny Blair, 1996, Designing Surveys. Pine Forge Press. ISBN 0-8039-9056-1. A useful resource for factual-style surveys, including material on interviews as well as mail surveys.

DeVellis, Robert F., 1991, Scale Development, Theory and Applications. Sage Publications, Applied Social Research Methods Series vol. 26. ISBN 0-8039-3776-8. Somewhat theoretical, but important information if you want to take questionnaire development seriously.

Dillman, Don A., Jolene D. Smyth, and Leah Melani Christian, 2014, Internet, Phone, Mail, and Mixed-Mode Surveys: The Tailored Design Method, 4th Edition John Wiley & Sons, NY. ISBN: 978-1-118-45614-9. A classic text. Students and I have profited from the earlier editions.

Ghiselli, Edwin E., John P. Campbell, and Sheldon Zedeck, 1981, Measurement Theory for the Behavioural Sciences. WH Freeman & Co. ISBN 0-7167-1252-0. A useful reference for statistical issues. Considered "very readable" by some.

Kline, Paul, 1986, A Handbook of Test Construction. Methuen. ISBN 0-416-39430-2. Practically-orientated, with a lot of good, helpful advice for all stages of questionnaire construction and testing. Some people find it tough going but it is a classic.

Marsden, Peter V, and James D Wright, 2010, Handbook of Survey Research. Emerald press, Bingley, UK. ISBN: 978-1-8455-224-1. Focus on surveys including topics on sampling, measurement, questionnaire construction, data analysis.

Stecher, Brian M., and W. Alan Davis, 1987, How to Focus an Evaluation. Sage Publications. ISBN 0-803903127-1. About more than just questionnaires, but it serves to remind the reader that questionnaires are always part of a broader set of concerns when carrying out an evaluation.

Any comments? How are we doing, so far?

Please don't copy this page since I hope It's going to change over time, but you are very welcome to create a link to it from your site. It would be nice to renew reciprocal links. All of the ones from before 2000 seem to have developed "link-rot". Please email me if you'd like a reciprocal link from here and we can re-start a network. Excerpts may be made from reasonable portions of this page and included in information material so long as my authorship is acknowledged.

If you have any comments, or want to suggest some extra questions or resources, please contact me: jzk@uxp.ie.

Acknowledgements

As always thanks to Murray Porteous for keeping me straight. Dick Miller, Owen Daly-Jones, Cynthia Toryu, Julianne Chatelain, Anne-Mari Flemming, Sean Hammond and Carolyn Snyder have all commented and stimulated. And thank you to my brother in the Lord, Kent Norman, who helped keep the flame alive.

Line