Open-source statistical software: R and the R Commander

Journal of Modelling in Management

ISSN: 1746-5664

Article publication date: 2 November 2010

1566

Citation

(2010), "Open-source statistical software: R and the R Commander", Journal of Modelling in Management, Vol. 5 No. 3. https://doi.org/10.1108/jm2.2010.29705caa.002

Publisher

:

Emerald Group Publishing Limited

Copyright © 2010, Emerald Group Publishing Limited


Open-source statistical software: R and the R Commander

Article Type: Tutorial section From: Journal of Modelling in Management, Volume 5, Issue 3

A “tutorial” section for the Journal of Modelling in Management (JMIM) was suggested by Professor Josef Mazanec with the aim of providing information and practical guidance with issues related to the analysis and presentation of research. It is hoped that this section will fulfil this role and attract interest and contributions from management researchers and practitioners working in a wide variety of research areas.

The objectives of the tutorial section are very broad and include the provision of a forum for papers that deal with the more practical side of modelling. Information about analysis techniques and the practicalities of running analyses is not always included in academic papers, and readers are often left to try and work out the details on their own if they can (often, one does not even know the statistical package that was used to obtain the models or graphics presented). In order to address some of these difficulties, we are planning to include items on software, data coding, recording and security, data analysis, graphics and the presentation of results in this part of the journal. Articles will concentrate on providing practical examples and demonstrations where readers will be able to access the data and the necessary software to recreate the analyses. We also aim to concentrate on explaining the more common analytical techniques and problems that will be of interest to a wide range of researchers rather than new or particularly complex techniques that are likely to have a more limited audience. The general aims and objectives for the tutorial section are as follows:

  • To showcase software that can be used to analyse data and present results.

  • To demonstrate the use of techniques that can be applied to the analysis of social science data and the presentation of results.

  • To provide instruction in the application of specific techniques that have been used in papers published in the JMIM.

  • To provide easy-to-follow example analyses which can be recreated by readers using the data sets and code made available on the web.

  • To broaden the analytical tool-kit of the majority of researchers.

We hope that this section will prove to be popular and useful and that it will become a regular part of the JMIM.

Call for papers

The JMIM invites the submission of articles and examples that illustrate methodological and practical issues associated with data collection, recording, analysis, graphics and presentation. Articles of no more than 5,000 words can be submitted via the journal web site at: www.emeraldinsight.com/jm2.htm

This paper addresses a basic problem when writing tutorials, that is of providing examples in a format that everyone has access to. Researchers use many different software packages, the majority of which employ their own methods for coding and saving data and have their own specific statistical analyses, graphics, terminology and procedures. As a major aim of these tutorials is to provide practical examples that all researchers can run on their own machines and adapt for their own data, it is useful to have access to software that can be used by all researchers regardless of the computing platforms they use and software licences their institutions possess. This tutorial describes one such package that has free access, is available for multiple computing platforms (Linux, Mac, Windows and Unix) and also has freely available documentation written in many languages.

The software described here is R, an open-source statistical package, largely developed and maintained by academic statisticians. Open-source computing (see www.gnu.org/licenses/gpl.html for a discussion of open source) has a number of advantages when applied to a rapidly developing academic field such as statistical analysis. Because of the large number of contributors and the ease with which techniques can be added to the system, R offers a far greater range of techniques than commercial packages can hope to. Having easy access to a wide range of techniques is vitally important as the statistical methods we employ and the way in which we present our results are often dictated entirely by the software we use. Management research, for example, appears to be dominated by a single commercial statistics package that only offers a fairly restricted set of analyses and graphics, relatively few of which can be considered to be up-to-date. This reliance on one particular software package has undoubtedly impacted on the techniques that are commonly used, particularly by research students, and, it can be argued, has stunted the application and development of appropriate research techniques.

The following tutorial describes how to install R and an easy-to-use graphical user interface (R Commander (Rcmdr)) and also describes some of the many techniques and graphics that can be obtained by using this package.

The basic R system

R is very easy to download and install on a number of platforms. Definitive information about installing R is available from the Comprehensive R Archive Network (CRAN) which can be accessed from the R home page (www.r-project.org). The instructions on this web site are very clear and comprehensive and would not be repeated here. There are also many guides available on the web for installing R,1 and if you have difficulty, you are advised to consult one or more of these (see, for example, www.Rcommander.com, which provides a quick installation guide).

In addition to installing R on individual computer systems, it is also easy to install on USB drives and CDs (www.Rcmdr.com provides a guide to installing the software onto removable media). Being able to load the software onto removable media and “take it with you” enables users to control their software use and provides true software choice and independence when using unfamiliar or networked systems (in a library, for instance). Being able to distribute working copies of the software on CD (the software will run directly from the CD without any installation onto the host machine) is also of great benefit to teachers and demonstrators who often work in unfamiliar surroundings with limited control of the computer system they are working with.

The R program itself provides a minimal interface (the R-console) from which commands can be issued to load, manipulate, analyse and graph data. Although this interface is very powerful once you get used to it, it can present a barrier to new users, particularly those who are accustomed to dealing with graphical interfaces with pull-down menus. Although not part of the basic installation, a number of graphical user interfaces have been developed for R that can be simply installed (for detailed information about all available graphical interfaces, see www.sciviews.org/_rgui/). Many of these interfaces are designed for specific audiences and research areas (see, for example, Rattle and Brodgar), but there are a number of more general ones (for example, Rcmdr, SciViews-R, JGR and RKward); try them out and see which one suits you the best. We will describe one of these interfaces here, Rcmdr, which is a general interface and one that is simple and also ideally suited to the analysis of social science data.

The R Commander

The Rcmdr is written and maintained by John Fox with full information about the project available on his web site (http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/). It is worth reading the aims and objectives for the Rcmdr, as it is only designed to cover techniques from a basic statistics course (with some additions) in a way that is simple and familiar. One of the design goals was to wean users from the GUI to writing commands on the R-console, which users will find essential if they are to get the most from R.

Rcmdr can be easily installed using the pull-down menus or through commands issued in the R-console. Using the pull-down menu options, Packages, Install package(s), etc. it is simple to install Rcmdr from one of the repositories where the software is stored (all you need is access to the web). Full instructions on how to do this are available on www.Rcmdr.com. Once installed, Rcmdr can be loaded into R via the Packages, Load Package, etc. menus or by typing library(Rcmdr) into the command line on the R-console.

Rcmdr provides an interface to many different techniques and provides the option to install a number of other libraries when it is first loaded. Users should allow the installation of additional libraries (these will be installed to the same drive that R is installed on). Once loaded, Rcmdr provides a simple interface for R, as shown in Figure 1.

 Figure 1 Rcmdr simply provides a menu system that enables a selection of
techniques to be used via a graphical interface

Figure 1 Rcmdr simply provides a menu system that enables a selection of techniques to be used via a graphical interface

The Rcmdr interface provides menus for a number of operations that are pretty self-explanatory. Menus are provided for opening and saving script files, editing scripts and outputs, loading, importing and manipulating data, running a selection of statistical analyses and graphics and investigating models and distributions. A full description of these menus can be found in Fox (2005) with an updated version available on his web site: http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/. In general, you should find that Rcmdr is intuitive and very similar to other GUI software. You should, therefore, have little difficulty in using this package to analyse your data.

Additional packages

The basic installations of R and Rcmdr only provide a relatively small selection of the packages that are available; many more can be downloaded and installed onto your computer/USB drive via the Packages, Install package(s) […] menu option or directly from the command line using install.packages(). Packages and libraries increase the functionality of R and Rcmdr by adding analytical techniques, data manipulation and graphical capabilities. Although installing packages is simple, the sheer number available can be daunting, particularly to new users. Currently, there are over 2,500 packages that can be installed from CRAN or one of the other repositories (look at the “Misc” menu option on the R homepage, particularly Bioconductor and Omega in “Related projects”), which makes selecting which packages to install and invest time in learning an important skill. Useful packages are often suggested by other researchers and in books, but it is also important to conduct general searches to identify packages that may be of particular interest to you.

One method to find out what is available is to use the CRAN web site and search for keywords. Go to http://cran.r-project.org/ and select Packages in the left-hand menu. Details of nearly 2,500 packages will be presented that can be searched for keywords using your browser (for example, search for the keywords “cluster analysis” or “missing data”). Once a package has been identified, it can be investigated in more detail by looking at the manual and/or vignette associated with the package (all packages come with full documentation). For example, a search for “missing” on CRAN identifies 16 different packages that deal with missing data, some via the command line (e.g. mitools and cat) and some with fully developed graphical user interfaces (e.g. Amelia and VIM). As new packages are frequently added, it is useful to search regularly just to “keep up-to-date”.

An easy method of identifying R packages is to simply use a search engine and enter the technique you want to use and include the letter R. This often results in a number of suggested packages. For example, the search string “data imputation using R” was inputted into Google; it resulted in over 9,00,000 hits and suggested a number of R packages that can be used to impute missing data (e.g. impute, loess, norm, amelia, ProbABEL, mice; Figure 2).

 Figure 2 An internet search for data imputation techniques using R

Figure 2 An internet search for data imputation techniques using R

Another useful resource for identifying packages is to look at the CRAN Task Views (http://cran.r-project.org/web/views/), which group similar packages together and explain which techniques are available. For management, the social sciences, econometrics and time-series task views are particularly relevant and include techniques for analysing panel data, data mining, propensity scores and matching and a whole range of structural equation modelling techniques.

 

Rcmdr plug-ins

There are also a number of plug-ins that have been written specifically for Rcmdr which add additional functions to the Rcmdr menu tree. These plug-ins have been specially adapted for Rcmdr and are installed in exactly the same way as all other packages, but they are loaded in Rcmdr using the Tools, Load Rcmdr plug-in(s) […] pull-down menu (Figure 3).

 Figure 3 Loading an Rcmdr plug-in

Figure 3 Loading an Rcmdr plug-in

There are currently 19 plug-ins available for the Rcmdr, and these provide menu selections for techniques such as time-series analysis, advanced factor analysis, meta analysis, survival analysis, quality control, experimental design and teaching demonstrations. The range of analyses available in Rcmdr is now substantial and means that a wide range of analyses can be conducted entirely within the Rcmdr GUI, without resorting to the command line at all (although users should naturally progress to the command line as their skills with using these techniques develop).

Data

It is useful to say something about data at this point, as this has proved to be a source of difficulty for users, particularly those migrating from other packages. Although R can process data saved in a wide variety of formats (see the manuals on data import/export which are available through the R home page), it certainly looks different to many commercial packages, as it does not use a spreadsheet style data frame. Although this can be confusing at first, this approach actually has many advantages, as it enables multiple data sets to be loaded in different formats and allows complex manipulations to be applied to the data as and when required. I will not go into detail here about data coding and manipulation, as this is one of the planned tutorials for the future. I will, however, provide a framework for working with data in R and Rcmdr.

Many data formats (for example, the SPSS.sav format) save information about variables that can be seen in the data (the numbers and labels in the spreadsheet) and also information contained in a number of “hidden” codes that identify information as missing values and the level the data are recorded at. These hidden codes can cause some difficulty, particularly when importing and exporting data to other formats. Rather than using filters and complicated data transformation programs in order to exchange data between programs, I recommend that the data are first recorded in an appropriate format, one which explicitly codes all information and also saves to a standard text-based format which makes data accessible to many more packages (see www.Rcmdr.com, for a full description). I also recommend that a single “master file” is created, which contains the most accurate record of the data to be analysed. Any data transformations, collapsing of categories and changing the format of variables (for example, changing a numeric variable to a categorical one in order to run a particular graphic or analysis) are completed on a temporary basis as and when required as part of the analysis. This procedure has the advantages of maintaining the original data, minimising the number of variables, making explicit the type of data collected and alleviating problems associated with changes to the data (any changes to a variable will require all instances where this variable has been recoded to be also changed and any data files that include this variable to also be re-compiled).

I use a procedure where data are inputted into a spreadsheet (I use an open-source spreadsheet package, Gnumeric, http://projects.gnome.org/gnumeric/, which has very good data manipulation capabilities and can also be installed and run directly from a USB drive or a CD), and then, save this to a comma-separated variable. This data can then be loaded into Rcmdr using the Import data option from the File menu. If the data need to be recoded or manipulated in some way (for example, sorting, collapsing categories, re-labelling, transforming, changing numeric codes into category labels, and removing missing data), this should be done in R or Rcmdr without changing the original data set (the recoding commands can be easily saved to script files and reused). At the end of the session, all temporary variables and data files are removed. I find that using this procedure minimises the proliferation of variables and data sets and also reduces confusion caused by multiple instances of files and which particular file was used for particular analyses. There are powerful data manipulation libraries in R and a number of procedures provided in Rcmdr. For more information on manipulating data in R, see the packages “memisc” and “gdata”.

Graphics

R and Rcmdr offer a bewildering selection of graphics that can be produced, many of which are not available in other packages. A great advantage of R is that graphics can be easily manipulated and new graphics are constructed using one or more of the freely available libraries (for example, ggplot, lattice, grid, vcd). There is not enough room here to go into detail about all the graphics that are available; however, a good selection can however be found on the web at the R graph gallery (http://addictedtor.free.fr/graphiques/) and on Paul Murrell’s web site (www.stat.auckland.ac.nz/∼paul/RGraphics/rgraphics.html). Figure 4 shows some simple graphics that have been produced by the graphic demonstration command “demo(graphics)” (simply enter this onto the command line in the R-console).

 Figure 4 Some example graphics

Figure 4 Some example graphics

Many of the graphics produced in R are easily manipulated for presentation purposes with full control over font placement and size, labels, symbols and layout. Graphics can be saved to a number of formats including eps, jpg, gif, pdf and png. Graphics may also be saved in pgf format using the tikzDevice library and edited directly using the tikz package or output to a pdf file via LaTeX (those who are familiar with LaTeX may be interested in this and can find additional information at www.texample.net/tikz/examples/).

Conclusion

Having used and taught many different software packages (these include R, STATA, SAS, SPSS, S-Plus, LISREL and GLIM), I have no hesitation in recommending R to all users, particularly if it is used in conjunction with Rcmdr. The package offers true software freedom and a greater range of techniques than any other statistical software. It is still developing, and I am sure that many more packages will be added to the repositories, with a growing proportion of these offering easy-to-use graphical interfaces. This system also offers the advantage of not providing a passive view of data analysis (i.e. this is the technique to use, and you press these buttons to get the output) and engages the users in wider debates about the techniques they are using. For example, step-wise regression is one technique for model selection that is encouraged uncritically in some software (it is practically the only technique on offer). Users may employ this technique without knowing anything of the controversy surrounding its use. R users, on the other hand, will use many different libraries and packages to analyse their data and are more likely to engage with the wider issues surrounding model selection.

For social science students, the main disadvantage of using R is the difficulty associated with using some of the packages and learning to do things in R that may have been simple using other packages. Although some things do take time to learn and a new way of working may be required, the effort is likely to be worth it in the long run. I believe that the case for using R is over whelming, and it will become increasingly important to be able to use R even if you still use a commercial package.

Graeme D. HutchesonManchester University, Manchester, UK

Useful internet sites

The R home page: www.r-project.org/

CRAN: http://cran.r-project.org/

CRAN task views: http://cran.r-project.org/web/views/

R Commander home page: http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/

R Commander resources: www.Rcmdr.com

R graph gallery: http://addictedtor.free.fr/graphiques/

Corresponding author

Graeme D. Hutcheson can be contacted at: graeme.hutcheson@manchester.ac.uk

References

Fox, J. (2005), “The R commander: a basic-statistics graphical user interface to R”, Journal of Statistical Software, Vol. 14, p. 9, available at: www.jstatsoft.org/v14/i09/paper

Further Reading

Crawley, M.J. (2007), The R Book, Wiley, Chichester

Murrell, P. (2005), R Graphics, Chapman & Hall, Boca Raton

R Development Core Team (2010), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, availabe at: www.R-project.org

Related articles