Revision History:
17 Feb 2000: removed comparison of un-nestable regresion models.
05 Feb 2000: First draft.
R is probably the most awesome statistics package going, if there
is such a thing as an awesome stats program. It can do things that SAS
cannot, it's far more flexible than most stats packages AND it's free to
boot. It links to databases, runs batch programs well, and has links to
visualization software. Have a look at the R
Project web site for more info.
Most active open source software projects have copious amounts of documentation.
R is no exception. There is the
R-FAQ
as well as a file called R-notes.ps which details much of the basic
R functionality. The program also has considerable online help that you
can access using the help.start() command. Venables and
Ripley have a good book
on the use of S, S+ and R; I had no trouble finding it in the
local university library. And finally, there is also the
R-help mailing list. Subscribe, and read. If at all possible, go
back through the archives to make sure that your question hasn't been asked before.
If you're running a recent Linux machine, then chances are that you already have all of the useful tools that you could possibly want. This would include gv or Ghostview (a viewer for postscript files) and ps2pdf (a utility to convert postscript files to PDF files) so that you can share your results with those using Windows).
If you're one of these Windows-using souls, you'll want to find GSview or something similar to read postscript files. If you can afford the Adobe Acrobat program, then by all means, you can use that to convert your graphs to an easier-to-read (for Windows users) format. And very seriously, consider using that old P100 clunker in the lab to run Linux, a windows manager like fvwm and R. If your lab is anything like the lab I was in, you can't really afford the latest and greatest hardware. Fortunately, Linux will run quite happily on hardware that isn't the 'lastest and greatest,' so there isn't any excuse not to benefit from the cornucopia of free, quality, open source research software that the Linux world has to offer.
Oh, one last thing. R is developing quickly. Bugs found in earlier versions
often get found and corrected quickly. It makes sense, therefore, to ensure
that the version of R that you are running is as up to date as possible,
which you can check via the R-help mailing list or on the main
R site.. If you keep using R, you will quickly realize how easily
it is to compile and install R and the relevant R libraries.
Bolding is used in the normal text to emphasize certain points.
Text written in fixed width font indicates output from the shell
or the R program. Bold fixed width font indicates commands
that I typed in as a user.
What you see below is me starting from my home directory on my Linux machine, and starting the R program using the R command. It should be quite similar for Windows users. Notice, however, that this is happening at the command line, and most of R works at this level. In my experience, most Windows users who 'just want to use the software' often feel uncomfortable using a command line interface. I assure you, however, that the road (however long) is one worth travelling.
[pete@esox pete]$ R
R : Copyright 2000, The R Development Core Team
Version 0.99.0 (February 7, 2000)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type "?license" or "?licence" for distribution details.
R is a collaborative project with many contributors.
Type "?contributors" for a list.
Type "demo()" for some demos, "help()" for on-line help, or
"help.start()" for a HTML browser interface to help.
Type "q()" to quit R.
[Previously saved workspace restored]
>
The 'greater than' arrow is the R prompt. here you can issue interactive commands to R. Alternatively, you can send job files to R in batch mode...
[pete@esox pete]$ R BATCH job1.R job1.Rout
This is nice because not only will R give you the results, it will also put them below the relevant line of your batch file. I can think of a few reasons to do this off the top of my head:
There are a lot of spreadsheet formats out there. Fortunately, R is compatible with the easiest format for all spreadsheets: comma- separated values (CSV in Excel-speak, ASCII "comma delimited" in Quattro-land). And yes, you *can* have title lines in the first row of the data file (and I STRONGLY encourage you to do just that).
Here's how I read in two comma-delimited files with headers on the top row from the R prompt:
> y <- read.table("rev-work-y3.txt", header=T)
> x <- read.table("rev-work-x3.txt", header=T)
This normally needs to be repeated every time you start R. However, if you are going to be using the same data frequently, you can have the data 'help' in memory by using the attach function.
> attach(y)
> attach(x)
Once this this done, you can keep a bunch of objects (datasets, etc) in memory every time you resume R. The objects function shows you which objects are in memory.
> objects()
[1] "cor.prob" "corr"
"cpx"
"cpy"
"lahod4csub"
[6] "lahodlm" "lahodsub"
"lahodtst" "last.warning" "lvod4csub"
[11] "lvodsub" "nspred"
"nsresp" "nsresp2"
"pred"
[16] "predtp" "resp"
"rslahodlm" "sublahod"
"sublahod4c"
[21] "sublvod" "sublvod4c"
"temp" "x"
"y"
In any case, it is a good idea to check to see that your data did in fact import properly by using the str function. Here's an example of what your imported data could look like...
> str(y)
`data.frame': 93 obs. of 4 variables:
$ LAHOD : num -0.869 -1.254 -1.220 -0.820 -1.161 ...
$ LAHOD4C: num -1.011 -1.354 -1.405 -0.925 -1.265 ...
$ LVOD : num -1.32 -1.64 -1.77 -1.45 -1.79 ...
$ LVOD4C : num -1.47 -1.74 -1.95 -1.55 -1.89 ...
Well, that looked good. All of my data in that dataset is numeric, and it is seen that way. Now I check the second dataset...
> str(x)
`data.frame': 93 obs. of 20 variables:
$ LDOC : num 0.613 0.672 0.556 0.906 0.958 ...
$ LTP : num 0.751 0.924 0.791 1.078 1.139 ...
$ LChlA : num 0.232 0.211 0.084 0.611 0.321 ...
$ LZmax : num 1.20 1.20 1.20 1.30 1.30 ...
$ LDepth : num 0.954 1.000 0.845 0.778 0.845 ...
$ LDSAm : num 6.13 6.13 6.13 6.79 6.79 ...
$ LLSAm : num 5.58 5.58 5.58 5.79 5.79 ...
$ LCALA : num 0.553 0.553 0.553 0.999 0.999 ...
$ LQS1 : num 0.400 0.400 0.400 0.781 0.781 ...
$ LRETENT1 : num 0.342 0.342 0.342 0.405 0.405 ...
$ LHAvT : num 0.941 0.865 1.007 0.874 0.872 ...
$ LLakeVol : num 6.36 6.36 6.36 6.46 6.46 ...
$ LLakeVSA : num 0.778 0.778 0.778 0.668 0.668 ...
$ LHypoVol : num 5.49 5.32 5.76 5.90 5.79 ...
$ LHypoVSA : num 0.455 0.389 0.549 0.625 0.629 ...
$ LHypoThick: num 0.875 0.813 0.978 1.161 1.130 ...
$ LVHVT : num -0.872 -1.035 -0.600 -0.565 -0.666 ...
$ LWRT : num -0.293 -0.293 -0.293 -0.740 -0.740 ...
$ LVTCA : num 0.227 0.227 0.227 -0.331 -0.331 ...
$ LVHCA : num -0.644 -0.807 -0.372 -0.896 -0.997 ...
Notice here that both datasets are 93 records long. I decided to separate by response variables (y) from my predictor variables (x) to make some of my tests easier.
IMPORTANT: Keep in mind that if you are writing batch jobs, you will have to stick the read.table statements at the beginning of your job for R to work properly (yes, even if.the data files are ALREADY attached).
[TODO: explain how to remove an item in the objects() list - I don't know how]
The corellation matrix is frequently one of the first steps in ecological
data exploration. We use the cor function to get the matrix of
R values.
> cor(x)
LDOC
LTP LChlA
LZmax LDepth
LDOC 1.000000000
0.611361871 0.33407957 -0.260539456 -0.48375717
LTP 0.611361871 1.000000000
0.66288872 -0.201233999 -0.38236143
LChlA 0.334079570 0.662888724
1.00000000 0.045514262 -0.12229689
LZmax -0.260539456 -0.201233999
0.04551426 1.000000000 0.26827900
LDepth -0.483757167 -0.382361430 -0.12229689
0.268279000 1.00000000
LDSAm 0.449771802 0.351250434
0.24425539 0.009913158 -0.13082742
LLSAm -0.120539033 -0.236122925 -0.15651418
0.198032581 0.14629193
LCALA 0.631619617 0.603163940
0.41348646 -0.141413134 -0.26981497
LQS1 0.621806457
0.603657096 0.41216664 -0.130307407 -0.25499195
LRETENT1 0.405044612 0.477506809
0.29981974 -0.029226060 -0.08223476
LHAvT 0.151358019 0.098025379
-0.18985805 -0.661892705 -0.32659125
LLakeVol -0.294367536 -0.338707562 -0.13812244
0.614303072 0.26977751
LLakeVSA -0.356056915 -0.205663884 0.06622402
0.867229544 0.29735711
LHypoVol -0.143434460 -0.091133859 0.05478524
0.754926805 -0.11473921
LHypoVSA -0.131927106 -0.005798852 0.16115870
0.859036308 -0.04839242
LHypoThick -0.149881840 -0.074237638 0.10047769 0.896063753
-0.08577070
LVHVT -0.002804946 0.100735371
0.17042304 0.646741955 -0.34374559
LWRT -0.631619617 -0.603163941
-0.41348647 0.141413134 0.26981497
LVTCA -0.700468534 -0.638000303 -0.36222809
0.503506714 0.35769036
LVHCA -0.402204724 -0.293345537 -0.08653418
0.744707852 -0.03853742
For my needs, though, I needed more than this. I also needed to know what probabilities that these R values were significantly different from 0. Well, I asked, and lo, I received: Open source works, and the proof is in the fact that when I sent out a request to the R-help list to try to see how I could get these, someone on the list replied quickly. Dr. Venables' reply is a great explanation of not only how you can calculate these probabilities, but also the R code needed to make it work:
cor.prob <- function(X, dfr = nrow(X) - 2) {
R <- cor(X)
above <-row(R) < col(R)
r2 <- R[above]^2
Fstat <-r2 * dfr / (1 - r2)
R[above] <-1 - pf(Fstat, 1, dfr)
R
}
All you have to do is to enter this, line by line, at the R command prompt (the > character). You can then call it much like any other function.
> cor.prob(x)
LDOC LTP
LChlA
LZmax LDepth
LDOC 1.000000000
7.568723e-11 1.065100e-03 0.011659249 9.001497e-07
LTP 0.611361871
1.000000e+00 4.539702e-13 0.053086103 1.554049e-04
LChlA 0.334079570 6.628887e-01
1.000000e+00 0.664860799 2.428764e-01
LZmax -0.260539456 -2.012340e-01
4.551426e-02 1.000000000 9.321101e-03
LDepth -0.483757167 -3.823614e-01 -1.222969e-01
0.268279000 1.000000e+00
LDSAm 0.449771802 3.512504e-01
2.442554e-01 0.009913158 -1.308274e-01
LLSAm -0.120539033 -2.361229e-01
-1.565142e-01 0.198032581 1.462919e-01
LCALA 0.631619617 6.031639e-01
4.134865e-01 -0.141413134 -2.698150e-01
LQS1 0.621806457
6.036571e-01 4.121666e-01 -0.130307407 -2.549920e-01
LRETENT1 0.405044612 4.775068e-01
2.998197e-01 -0.029226060 -8.223476e-02
LHAvT 0.151358019 9.802538e-02
-1.898580e-01 -0.661892705 -3.265912e-01
LLakeVol -0.294367536 -3.387076e-01 -1.381224e-01
0.614303072 2.697775e-01
LLakeVSA -0.356056915 -2.056639e-01 6.622402e-02
0.867229544 2.973571e-01
LHypoVol -0.143434460 -9.113386e-02 5.478524e-02
0.754926805 -1.147392e-01
LHypoVSA -0.131927106 -5.798852e-03 1.611587e-01
0.859036308 -4.839242e-02
LHypoThick -0.149881840 -7.423764e-02 1.004777e-01
0.896063753 -8.577070e-02
LVHVT -0.002804946 1.007354e-01
1.704230e-01 0.646741955 -3.437456e-01
LWRT -0.631619617 -6.031639e-01
-4.134865e-01 0.141413134 2.698150e-01
LVTCA -0.700468534 -6.380003e-01
-3.622281e-01 0.503506714 3.576904e-01
LVHCA -0.402204724 -2.933455e-01
-8.653418e-02 0.744707852 -3.853742e-02
Lots more lines follow below this, representing the other rows being
analysed this way. Notice that now the p-values are in the 'top' part of
the matrix. Another interesting observation that I didn't make at the time
was that after I added this function, it was added as an object to my version
of R. If you go back up and look at the output from the objects
function above, you will see an object called "cor.prob".
Often done with the corellation matrix, the scatterplot matrix is a graphical way of looking for pattern in data. It's also useful to look for pattern between your predictor values so that you can decide which predictor variables to remove from your dataset. The command is very simple, but beware: this will fill up a screen very, very quickly if you have a lot of columns in your dataset. I have a 17" monitor on the Linux machine, and running this function on my 20-variable data set more than fills an entire screen.
I have a small, four-variable dataset (the y dataset I mentioned earlier)
> pairs(y)
[TODO: explain how to send this to a postscript file, and not to the screen]
[TODO: somebody needs to write a general online tutorial on graphing.]
Regression is commonly used in ecology to express explanatory or predictive relationships. It helps us understand how one factor can be affected by one or more variables. In this, I am interested in seeing what can predict chlorophyll A concentrations in our lakes. Already, papers like Dillon and Rigler [1974] have detailed the relationship between total phosphorus (LTP, nutrient) concentration and chlorophyll A (LChlA, algal biomass) concentration. Our Shield lakes are somewhat different from those in the Dillon and Rigler paper, so let's see *how* different.
Here, all my data is in the x dataset, so I will preface my variable names with the data set identifier (x) and then the $ separator.
> summary(lm.chl <- lm(x$LChlA ~ x$LTP))
Call:
lm(formula = x$LChlA ~ x$LTP)
Residuals:
Min
1Q Median 3Q
Max
-0.335560 -0.100729 -0.001572 0.108871 0.273484
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.40315 0.08262 -4.880 4.50e-06
***
x$LTP 0.77338
0.09157 8.446 4.54e-13 ***
---
Signif. codes: 0 `***' 0.001 `**'
0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 0.1263 on 91 degrees of freedom
Multiple R-Squared: 0.4394, Adjusted R-squared:
0.4333
F-statistic: 71.33 on 1 and 91 degrees of freedom,
p-value: 4.54e-13
What does this tell us? Looking under the "Pr(>|t|)" column, we see that TP is a significant predictor of chlorophyll concetration in our lakes (notice the three *'s at the end of the line, they indicate p < 0.001, or specifically p=4.54e-13, but in any case, it's significant and we don't have to worry much beyond that). Also, notice that the intercept is also highly significant.
Note that I could have easily done the above command in two steps:
> lm.chl <- lm(x$LChlA ~ x$LTP)
> summary(lm.chl)
Just use which ever way works for you.
Knowing that TP is an important predictor of chlorophyll concentrations, we can then ask if there is another variable that would help predict chlorophyll A concentrations better. Our lakes lie on the Canadian Shield, and tend to have have dissolved organic carbon (DOC) concentrations. This brings up the question of whether DOC can be an additional useful predictor of TP concentrations in these lakes:
> lm.chladoc <- lm(LChlA ~ LTP + LDOC)
Error in eval(expr, envir, enclos) : Object "LChlA" not found
The error here lies in the fact that I didn't specify WHICH dataset the predictor variables were from. Easily rectified...
> lm.chladoc <- lm(LChlA ~ LTP + LDOC, data=x)
> summary(lm.chladoc)
Call:
lm(formula = LChlA ~ LTP + LDOC, data = x)
Residuals:
Min
1Q Median 3Q
Max
-0.33431 -0.09577 0.01001 0.10762 0.26063
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.38433 0.08409 -4.571 1.54e-05
***
LTP 0.85446
0.11551 7.397 6.99e-11 ***
LDOC -0.12465
0.10857 -1.148 0.254
---
Signif. codes: 0 `***' 0.001 `**'
0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 0.1261 on 90 degrees of freedom
Multiple R-Squared: 0.4475, Adjusted R-squared:
0.4352
F-statistic: 36.45 on 2 and 90 degrees of freedom,
p-value: 2.538e-12
From these results, we glean that DOC is not a good additional predictor of chlorophyll A concentrations in our lakes.
Note that I could have also written the above command this way:
> summary(lm.chladoc <- lm(LChlA ~ LTP + LDOC, data=x))
[TODO: change this example to use an existing R dataset so that others can use
the example exactly]
When a number of potentially similar regression lines are created, it is often important to be able to tell whether these lines are in fact significantly different from each other. Going through another example from my data, I have two models explaining oxygen depletion rates in my lakes. The first model using morphometry (LHypoPSA), nutrients (LTP) and organic matter (LDOC). The second uses morphometry alone. I then test to see if these are different.
> summary(lm.lahod2 <- lm(y$LAHOD ~ LHypoVSA + LTP + LDOC, data=x))
Call:
lm(formula = y$LAHOD ~ LHypoVSA + LTP + LDOC, data = x)
Residuals:
Min
1Q Median 3Q
Max
-1.52635 -0.07084 0.09872 0.18400 0.39080
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.3833 0.2150
-6.434 6.07e-09 ***
LHypoVSA 0.5782
0.1132 5.109 1.83e-06 ***
LTP
0.7008 0.2871 2.441 0.0166
*
LDOC -0.6935
0.2722 -2.548 0.0126 *
---
Signif. codes: 0 `***' 0.001 `**'
0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 0.312 on 89 degrees of freedom
Multiple R-Squared: 0.3045, Adjusted R-squared:
0.281
F-statistic: 12.99 on 3 and 89 degrees of freedom,
p-value: 4.124e-07
This is in fact a significant model. Each of the variables are significant, along with the intercept and the overal regression is significant at the 0.05 level or better. What if we use a simpler model...
> summary(lm.lahod3 <- lm(y$LAHOD ~ x$LHypoVSA))
Call:
lm(formula = y$LAHOD ~ x$LHypoVSA)
Residuals:
Min
1Q Median 3Q
Max
-1.64386 -0.05429 0.09138 0.16604 0.37745
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.28446 0.05778 -22.231 <
2e-16 ***
x$LHypoVSA 0.62429 0.11514
5.422 4.81e-07 ***
---
Signif. codes: 0 `***' 0.001 `**'
0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 0.3216 on 91 degrees of freedom
Multiple R-Squared: 0.2442, Adjusted R-squared:
0.2359
F-statistic: 29.4 on 1 and 91 degrees of freedom,
p-value: 4.806e-07
This too is significant, and the other model gives us a slightly better RSE and a higher adjusted R-squared. So we expect these two lines to be different, and we test them to make sure:
> anova(lm.lahod2, lm.lahod3)
Analysis of Variance Table
Model 1: y$LAHOD ~ LHypoVSA + LTP + LDOC
Model 2: y$LAHOD ~ x$LHypoVSA
Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F)
1 89 8.663
2 91 9.414
-2 -0.751 3.8578 0.02473 *
---
Signif. codes: 0 `***' 0.001 `**'
0.01 `*' 0.05 `.' 0.1 ` ' 1
Yes, these lines are in fact different from each other at the p<0.05 level.
Thanks to Martin Maechler <maechler@stat.math.ethz.ch> for explaining it so clearly and mailing his explanation to the R-help list.
[TODO: Modify this example so that it uses an existing R dataset]
Installing packages in R is a relatively straightforward procedure. Once you have downloaded the package of interest (in this case I downloaded the leaps package for all-subsets regression from the R site), the first step to installing the library was to untar it into its own directory.
[pete@esox pkgs]$ tar xvzf leaps_2.0.tar.gz
leaps/
leaps/src/
leaps/src/leaps.f
leaps/src/leapshdr.f
leaps/man/
leaps/man/leaps.Rd
leaps/man/subsets.Rd
leaps/man/plot.subsets.Rd
leaps/man/leaps.setup.Rd
leaps/INDEX
leaps/TITLE
leaps/README
leaps/R/
leaps/R/leaps.R
leaps/R/plot.subsets.R
leaps/DESCRIPTION
The next step then is to call R directly with the INSTALL declaration. Notice that I did not decend into the leaps directory after untarring it. This is important.
[pete@esox pkgs]$ R INSTALL leaps
ERROR: cannot write to or create directory `/usr/local/bin/R/lib/R/library'
Rather than set the permissions on /usr/local/bin/R/lib/R/library so that anyone
can write to it, I decided to become root and install the library that way.
[pete@esox pkgs]$ su root
Password:
[root@esox pkgs]# R INSTALL leaps
Installing package `leaps' ...
libs
g77 -fPIC -g -O2 -c leaps.f -o leaps.o
g77 -fPIC -g -O2 -c leapshdr.f -o leapshdr.o
gcc -shared -o /usr/local/bin/R/lib/R/library/leaps/libs/leaps.so leaps.o leapshdr.o
-L/usr/lib/gcc-lib/i386-redhat-linux/egcs-2.91.66 -L/usr/i386-redhat-linux/lib -lg2c -lm
R
help
>>> Building/Updating help pages for package `leaps'
Formats: text html latex example
DONE (leaps)
DONE (INSTALL)
[root@esox pkgs]#
R then called the relevant compiler (in this case, g77 - the Fortran 77 compiler) to complete the compilation, installation of the binary files, as well as updating the help pages for this package.
For most R packages, this is all you have to do. Occasionally, however, when the library points to an external program (such as Xgobi) that you have already installed, you will need to specify where the binary files are located before you run R INSTALL. In the case of the Xgobi library, you'll need to edit two files after you untar the library: these are xgobi.R and xgvis.R located in the xgobi/R/ directory. I installed my copy of Xgobi in /usr/local/bin/
Open the files with a text editor, and set the path appropriately - you'll
see where the comments in the file indicated where to put the path.
The leaps contains functions to carry out all-subsets regression, basically find the best possible models to describe or predict a particular variable. It can be an effective way to find the best predictors from amongst a group of variables. This is similar to the RSQUARE selection method in SAS.
In this example, I was looking for the best predictors of oxygen depletion. The first step was to call the leaps library (I had installed it earlier)...
> library(leaps)
> sublahod <- subsets(x, y$LAHOD, nvmax=8, nbest=10)
I was using the contents of the x dataset to regress against
the oxygen depletion variable (y$LAHOD, and I wanted to run
through the best ten models (nbest=10) for each model length
- one through eight predictor terms (nvmax=8).
> summary(sublahod, matrix.logical=T)
Subset selection object
20 Variables (and intercept)
Forced in Forced out
LDOC FALSE FALSE
LTP FALSE FALSE
LChlA FALSE FALSE
LZmax FALSE FALSE
LDepth FALSE FALSE
LDSAm FALSE FALSE
LLSAm FALSE FALSE
LQS1 FALSE FALSE
LRETENT1 FALSE FALSE
LHAvT FALSE FALSE
LLakeVol FALSE FALSE
LLakeVSA FALSE FALSE
LHypoVol FALSE FALSE
LHypoVSA FALSE FALSE
LHypoThick FALSE FALSE
LCALA FALSE FALSE
LVHVT FALSE FALSE
LWRT FALSE FALSE
LVTCA FALSE FALSE
LVHCA FALSE FALSE
10 subsets of each size up to 9
Selection Algorithm: exhaustive
LDOC LTP LChlA LZmax LDepth
LDSAm LLSAm LCALA LQS1 LRETENT1 LHAvT LLakeVol
1 ( 1 ) FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE
1 ( 2 ) FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE
1 ( 3 ) FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE
1 ( 4 ) FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE
1 ( 5 ) FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE
1 ( 6 ) FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE
1 ( 7 ) FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE
1 ( 8 ) FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
TRUE FALSE
1 ( 9 ) FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE
1 ( 10 ) FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE TRUE
2 ( 1 ) FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE
2 ( 2 ) FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE
and so on... This function produces a LOT of output. I used this output, particularly the lower matrix to build a bunch of lm calls using a spreadsheet (where IF() and @IF() are your friends)
[TODO: I still don't know how to get leaps to do forward stepwise regression.
Can anybody offer suggestions?]
Visualization of data is often something which takes a LOT of time in ecology, often because one is forced to make a lot of paper graphs. Being able to explore data visually would not only save time and trees, but would also allow one to more effectively locate trends in ecological data by doing so visually. Recently, I found the binaries for the Xgobi program, a free data exploration tool that can be used in Linux.
Xgobi's documentation explains how to install it; my only suggestion is to do the installation in the directory where you want it to reside permanently. As I explained above, you will have to edit two source files to indicate where the program is installed, so it just makes sense to put the program directory in a logical place.
I'm just starting to play with Xgobi now, so I don't have a lot of suggestions as to what exactly to do with it, other than to explore and read the documentation that either comes with the program, or some of the papers mentioned on the Xgobi web site.
[TODO: I suspect that Xgobi could have a whole doc written about it. Any takers?]