Free/Libre and Open Source Software:
Survey and Study
Deliverable D18: FINAL REPORT
4A: Survey of Developers -
Annexure on validation and methodology
Rishab A. Ghosh
International Institute of Infonomics
University of Maastricht, The Netherlands
(In response to comments received after the initial publication of the survey report)
Early in the design of the survey methodology we faced the question of sampling: was it possible to survey a representative sample of developers? The question actually has two parts: is it possible to ensure that respondents are developers, and is it possible to identify a sample that is representative of developers based on some filtering criteria?
Our conclusion was that there is insufficient empirical data on OS/FS developers to identify the criteria of sampling. However, without empirical data as a basis it is not possible to demonstrate that a chosen sample of respondents is representative of developers in general: i.e. it isn't possible to sample developers and know with any confidence that the distribution of nationalities, skill levels, income levels or leadership is representative of the distribution in the total (unsampled) population of developers.
Therefore, we decided that in order to have results that would be empirically valid, we would have to have a random sample. The survey was self-distributing - i.e. it was posted to various developer forums, and then re-posted by developers to other forums, some of which are listed on the workshop website. The survey announcement was translated into various languages, to correct possible biases inherent in an English-language survey. We can state that the survey was seen by a very large number of developers (it was announced on slashdot, among other places) and therefore the sample that chose to respond was random, though with some identifiable bias. We are also able to state that the respondents are developers, through the validation process described below.
There is a self-selecting bias that is inherent in any survey that is answered voluntarily. There is also a possible bias towards more motivated developers, though not necessarily the more politically motivated. There may be some over-representation of some nationalities - e.g. France has not previously appeared so high on demographic surveys of OS/FS developers. One reason for this could be that the survey announcement was translated into French (and distributed on French developer forums). Similarly, there may be under-representation of some nationalities especially Asian developers.
One of the problems aligned with online questionnaires is that often it cannot be verified whether the respondents really belong to the group that is under scrutiny. Therefore, to validate our data in this respect, we used a sub-sample of 487 respondents individually identified as OS/FS developers as a measure to validate our overall results by comparing their answers to the answers of those who were not known personally. (Developers were identified by matching the 'e-mail address' field to names found in the source code analysis - see part 5 of the final report - or matching them to sources on Internet archives.)
The validation simply consisted of a comparison of means and standard deviations of the two groups ('known developers' and other respondents) with regard to a selection of variables of our data set. The comparison is presented in the tables at the end of this annexure, which highlight the few relatively large differences between verified and non-verified responses. This shows that the group of verified OS/FS developers consists of slightly more active and "professionally" experienced persons, but their answers do not differ significantly from those of the non-verified OS/FS developers, especially in terms of orientations and motivations.
The whole procedure of the validation kept, of course, to the privacy requirements of the respondents. The first step, identification of the sub-sample, was conducted separately from the main analysis of the survey data. Only the ID-number (a serial number uniquely generated for each respondent) and two variables providing personal features (e-mail address / address fragment) were used for the first step. All other data based on the answers given to the FLOSS online questionnaire was excluded from this process. After identification, the data of the sub-sample were made anonymous by replacing all personal information by the single attribute "verified" or "not verified". After this transformation, the validation data were integrated into the data set of the survey containing information about the answers given to the questions of the online questionnaire. Thus, at no point of the analysis was it possible to assign answers to the questions of the survey to particular persons as identified through source code or online archives.
Comparative table of verified and non-verified respondents