Tuesday, April 3, 2012

R Annoyances and Gripes

R is a superb data analysis and graphing program. It has a solid programming environment. It has great docs, although I wish it had even more "see also" and more "examples" in the man pages. many CRAN packages are great. the R team members and folks answering r-help posts are saints, though some of them can be quite grumpy.

[I try to integrate some of the comments from below.  thx everyone.]

Alas, because R is so good, I am probably too tempted to write too many programs in R. Unfortunately, for a generic programming language, R still lacks some sugar.

  1. It should be possible for a user to set an option that forces aborts immediately upon read access into a non-existing list item or data frame column (i.e., if this data frame or list that has not yet been initialized). instead, R simply silently yields NULL. Try it!
    .GlobalEnv$misspelled
    
    or even
    .GlobalEnv$misspelled[100] 
    
    do not cause a program abort. good luck finding the error later.  Gabor also suggests using the .GlobalEnv[["misspelled"]] construct, which does check.  Still, the '$' syntax is often so much nicer....
  2. It should be possible for a user to turn automatic recycling off. I have programmed bugs which I could have caught earlier if the program had bombed on non-conformable assignments. why do we need BY DEFAULT  the following to work?
    v←1:4
    w←1:8
    z←v+w
    
    Really? Are you sure it is what you wanted? If it really is, you should set an option first, or better yet, rep(v( yourself.  there should be an option(recycle="default"|"scalars-only"|"never").  you can even leave the current default, but allow *me* to turn this off.  over time, packages may switch their model, and, by default, everything should still just work without changes.
  3. An error should always clearly identify the lines in a program where it occurred.
    Simple Fix: Turn on
      options(error=recover)
    again and run my program again to understand where my program bombed.
  4. An assert()should be built into the R language---assert should be like a stopifnot() function, but with a cat message function. So, I could write
    assert( ncol(x)==ncol(y), "Sorry, but ncol(x)=",ncol(x)," != ncol(y)=", ncol(y), "\n")
    the big advantage of building assert into the language, rather than having it user defined, is that an assert error will bomb showing the correct line, not an error inside the assert function, requiring a stack backtrack.  it could even drop you into the right stack-frame at that time---an advantage currently only for the main control logic but not logic inside functions is the immediate ability to examine the variables that were associated with the failure.

    Programming by contract would be even better. (package testthat has some testing functions, but it's not the same.)   Perhaps a novel pretty syntax, like

    [ ncol(x) != ncol(y) => "Sorry, but", ncol(x), "is not", ncol(y) ]
    

    would be even better, abusing both the '[' when it starts a statement and and '=>' . Such readable and SHORT syntax, when paired with a clear error on which line the error occurred when it triggers, would encourage everyone to use a lot of assertions, including at each function start and end.
  5. there is no way for an end user (not library writer) to add his own function doc to the set of docs that one can interrogate with '?' request for doc about a function.  suggestion: use package skeleton for building full packages.  nothing for end users, though.
  6. similar---can we please add something akin to the perl6 pod to R?  please adopt some good features from perl, even though R is of course not perl.  PS: and where/what are the standard filename conventions? I think it is .Rdata for data files. right? (Is it .Rh for R inclusion files?)
  7. nice but unnecessary: Syntactic sugar: would it not be great if '$' inside a string would do a paste-with-autocollapse? How ugly is it to call my function as
    f( paste("msg is '",m,"'", collapse="") )
    
    compared to the more readable
    f("msg is '$m'")
    
    ?  but the simpler paste0 goes a long way...
  8. There should be an easier way to assign from a list to multiple objects
    (a,b) ← f()
    
    Gabor G created a great R function that facilitates this, but something like it should be built in.
  9. R should be smart enough to understand one special case to allow quickly assigning to individual elements in a data frame. Believe it or not, but
      bigframe$a[12]←12
    
    will actually copy the bigframe object a few times, instead of just replacing the single cell's content. if bigframe is big, you can count the number of assignments per second on an Intel core 2012 system. yikes! of course, I know this now, but all novices probably learn this the hard way. they may get the impression that R is slow when they (like I did before I knew) write,
      for (i in 1:10000) bigframe$soln[i]← uniroot(myf, c(-Inf,Inf), i)
    
    But
      for (i in 1:10000) soln[i]← uniroot(myf, c(-Inf,Inf), i);
      bigframe$soln←soln; rm(soln)
    
    is reasonably fast. (The data.table R package is one way to get around this, but again, this is not what novices would know.)  This is important to fix IMHO.
  10. nice but unnecessary.  I would love to have many of my functions defined at the end after my control logic at the start of my R code. thus, I would love it if I could instruct R to scan the source file for functions first before executing---as it is in perl. This behavior could be enabled by a user and not be the default.
  11. probably a bad idea on my part.  there should be "static" (local persistent) variables in R. Use very sparingly, of course.  I would use it for in-function caching of previously seen/handled cases.

Some more minor gripes:
  1. load() should have an option to write to stderr what objects are being loaded. in fact, such stderr output should probably be the default to remind the user, just as library loading usually does.  load() should also have an option to check its environment first to see if the object already exists, and load it only when still needed...sort of like a library() invocation.
  2. there should be a legacy option, without which demoted features cause warnings and errors...and let's demote attach and detach.
  3. if
      option(na.rm=TRUE)
    
    then mean(c(1:5,NA)); cor(c(1:5,NA), c(rnorm(6)) should give values. I have read about na.action, etc, but I could never figure out how to make it work.  lm() ignores NA even silently without any option.  why one and not the other?  of course, it doesn't have to be fully consistent from day 1.  but let's just get it started.  mean, sd, var, median, summary could be fixed.
  4. bad idea on my part.  paste0 is better. There should be
    option(pastechar=" ")
    option(catchar=" ")
    
    to allow setting of the default sep for paste and cat. Incidentally, I like space to be the default to separate the elements in a vector, but I do not like space before and after separate arguments to paste and cat. I can add the latter myself more easily.
  5. The commonly used descriptive summary() for a data object should also include the number of observations, the number of NA's, and the sd as built in; possibly even the T-stat.  I have programmed my own summary(), but this would be a good idea for everyone.
  6. the parallel core library is superb. finally an easy way for me to use the 8 cores in my Mac Pro! however, mclapply does not make it easy to figure out how many times the function was called. will it take another 100 days, or another 100 seconds to finish? tough to guess. a progress counter would help.  Workaround:
    sleep1sec ← function(i) {
       counter<←counter+1;
       cat("Counter=",counter, " i=",i, "\n");
       Sys.sleep(1); rnorm(1) };
    mclapply(1:1000, sleep1sec) 
    
    but unfortunately each process has its own global counter. still better than no process indicators whatsoever, though.
  7. the two graphics plotting systems should really be replaced by one. I also don't understand the difference between S3 and S4. do we need this?
  8. very minor: sleep() should be a function that sleeps, not a data object that contains how much students slept. Sys.sleep() is what I need. ok, this is a very trivial gripe. but the docs for sleep (?sleep) should say "see also Sys.sleep".
  9. why do I need both source and load ? couldn't these functions detect whether I am reading an .R or an .Rdata file and invoke the right one? both are R objects.
  10. ok.  I need to use the apply family more.  Does anyone have an idea how to abbreviate the common
    mylist← rnorm(5); for (si in 1:length(mylist)) { value← mylist[si]; ...
    
    into something that reads quicker and still has a counter?  In R, having a counter is more important than it is in perl, because iterative list creation via push is discouraged relative to list creation via indexed assignment (for speed reasons, I think).  Maybe a new iterator, like
    iterate(si, value, rnorm(5)) { ...
    
    perhaps? I know this is wild syntactic sugar, but then I write user programs and not computer languages, and I would like my programs to be readable and elegant, more than I would like R to make perfect sense.
  11. why don't more options allow mnemonics? for example, options(warn=1) means what? couldn't it be options(warn="immediately"); or, why not text(...,pos="left"), instead of test(...,pos=2)?
  12. why do some functions wrap quotes around variables? For example, why is it library(something), instead of library("something")? something without quotes should be a variable. same thing for select in subset statements. tell me: what is
    d <- data.frame( a=1:3, b=5:7 )
      a <- "b" ; a2 <- "b"
      subset( d, TRUE, select=c(a) )
      subset( d, TRUE, select=c(a2) )
    ok, if it were always optional to omit the quotes, I would understand it. but it isn't the case with a subset-select, in which I want to delete a variable.  subset(d, TRUE, select= -c(x)) works, but subset(d, TRUE, select= -c("x")) does not.  huh?
  13. merge should also be capable of merging by rowname, not just by columns.
  14. Core Team: Make your life easier!  Would it be possibly to pseudo-wikify the docs? I.e., allow collaborative suggestions? I often think "this ?... should have included a see also to ...", but there is no easy way for me to fix it right there and then for future users. the existing R bug+suggestion system is ok for bigger issues, of course, but painful for these small items. I also wonder whether my suggestions are actually welcome or a distraction. and, of course, I understand that any short edits should require package maintainer approval in the end. my guess is that many suggestions to the docs would be for the better...and, unlike code itself, docs are something that ordinary users can contribute to. after all, it is they who often need and use the docs the most.
Please don't take these as criticisms of the work that the R team has put into R. It is easier for users to gripe than it is to implement features. And obviously I am not willing or able to put these features in myself.

Of course, R has so much magic and so many features that it may well be the case that many of my gripes above already have solutions, but I just don't know them. :-( if you see any, let me know, please. And let me know what gripes should I have complained about that I missed, too.


PS: Next, I need to figure out how to create LaTeX output of regressions and data systematically. there are a few packages on cran, but I am not yet sure which one I want to use.

PPS: I just discovered Doug Bates' blog on Julia. Interesting. I don't miss what Julia provides (e.g., the JIT) too much. the stuff above bothers me more.