Tuesday, April 3, 2012

R Annoyances and Gripes

R is a superb data analysis and graphing program. It has a solid programming environment. It has great docs, although I wish it had even more "see also" and more "examples" in the man pages. many CRAN packages are great. the R team members and folks answering r-help posts are saints, though some of them can be quite grumpy.

[I try to integrate some of the comments from below.  thx everyone.]

Alas, because R is so good, I am probably too tempted to write too many programs in R. Unfortunately, for a generic programming language, R still lacks some sugar.

  1. It should be possible for a user to set an option that forces aborts immediately upon read access into a non-existing list item or data frame column (i.e., if this data frame or list that has not yet been initialized). instead, R simply silently yields NULL. Try it!
    .GlobalEnv$misspelled
    
    or even
    .GlobalEnv$misspelled[100] 
    
    do not cause a program abort. good luck finding the error later.  Gabor also suggests using the .GlobalEnv[["misspelled"]] construct, which does check.  Still, the '$' syntax is often so much nicer....
  2. It should be possible for a user to turn automatic recycling off. I have programmed bugs which I could have caught earlier if the program had bombed on non-conformable assignments. why do we need BY DEFAULT  the following to work?
    v←1:4
    w←1:8
    z←v+w
    
    Really? Are you sure it is what you wanted? If it really is, you should set an option first, or better yet, rep(v( yourself.  there should be an option(recycle="default"|"scalars-only"|"never").  you can even leave the current default, but allow *me* to turn this off.  over time, packages may switch their model, and, by default, everything should still just work without changes.
  3. An error should always clearly identify the lines in a program where it occurred.
    Simple Fix: Turn on
      options(error=recover)
    again and run my program again to understand where my program bombed.
  4. An assert()should be built into the R language---assert should be like a stopifnot() function, but with a cat message function. So, I could write
    assert( ncol(x)==ncol(y), "Sorry, but ncol(x)=",ncol(x)," != ncol(y)=", ncol(y), "\n")
    the big advantage of building assert into the language, rather than having it user defined, is that an assert error will bomb showing the correct line, not an error inside the assert function, requiring a stack backtrack.  it could even drop you into the right stack-frame at that time---an advantage currently only for the main control logic but not logic inside functions is the immediate ability to examine the variables that were associated with the failure.

    Programming by contract would be even better. (package testthat has some testing functions, but it's not the same.)   Perhaps a novel pretty syntax, like

    [ ncol(x) != ncol(y) => "Sorry, but", ncol(x), "is not", ncol(y) ]
    

    would be even better, abusing both the '[' when it starts a statement and and '=>' . Such readable and SHORT syntax, when paired with a clear error on which line the error occurred when it triggers, would encourage everyone to use a lot of assertions, including at each function start and end.
  5. there is no way for an end user (not library writer) to add his own function doc to the set of docs that one can interrogate with '?' request for doc about a function.  suggestion: use package skeleton for building full packages.  nothing for end users, though.
  6. similar---can we please add something akin to the perl6 pod to R?  please adopt some good features from perl, even though R is of course not perl.  PS: and where/what are the standard filename conventions? I think it is .Rdata for data files. right? (Is it .Rh for R inclusion files?)
  7. nice but unnecessary: Syntactic sugar: would it not be great if '$' inside a string would do a paste-with-autocollapse? How ugly is it to call my function as
    f( paste("msg is '",m,"'", collapse="") )
    
    compared to the more readable
    f("msg is '$m'")
    
    ?  but the simpler paste0 goes a long way...
  8. There should be an easier way to assign from a list to multiple objects
    (a,b) ← f()
    
    Gabor G created a great R function that facilitates this, but something like it should be built in.
  9. R should be smart enough to understand one special case to allow quickly assigning to individual elements in a data frame. Believe it or not, but
      bigframe$a[12]←12
    
    will actually copy the bigframe object a few times, instead of just replacing the single cell's content. if bigframe is big, you can count the number of assignments per second on an Intel core 2012 system. yikes! of course, I know this now, but all novices probably learn this the hard way. they may get the impression that R is slow when they (like I did before I knew) write,
      for (i in 1:10000) bigframe$soln[i]← uniroot(myf, c(-Inf,Inf), i)
    
    But
      for (i in 1:10000) soln[i]← uniroot(myf, c(-Inf,Inf), i);
      bigframe$soln←soln; rm(soln)
    
    is reasonably fast. (The data.table R package is one way to get around this, but again, this is not what novices would know.)  This is important to fix IMHO.
  10. nice but unnecessary.  I would love to have many of my functions defined at the end after my control logic at the start of my R code. thus, I would love it if I could instruct R to scan the source file for functions first before executing---as it is in perl. This behavior could be enabled by a user and not be the default.
  11. probably a bad idea on my part.  there should be "static" (local persistent) variables in R. Use very sparingly, of course.  I would use it for in-function caching of previously seen/handled cases.

Some more minor gripes:
  1. load() should have an option to write to stderr what objects are being loaded. in fact, such stderr output should probably be the default to remind the user, just as library loading usually does.  load() should also have an option to check its environment first to see if the object already exists, and load it only when still needed...sort of like a library() invocation.
  2. there should be a legacy option, without which demoted features cause warnings and errors...and let's demote attach and detach.
  3. if
      option(na.rm=TRUE)
    
    then mean(c(1:5,NA)); cor(c(1:5,NA), c(rnorm(6)) should give values. I have read about na.action, etc, but I could never figure out how to make it work.  lm() ignores NA even silently without any option.  why one and not the other?  of course, it doesn't have to be fully consistent from day 1.  but let's just get it started.  mean, sd, var, median, summary could be fixed.
  4. bad idea on my part.  paste0 is better. There should be
    option(pastechar=" ")
    option(catchar=" ")
    
    to allow setting of the default sep for paste and cat. Incidentally, I like space to be the default to separate the elements in a vector, but I do not like space before and after separate arguments to paste and cat. I can add the latter myself more easily.
  5. The commonly used descriptive summary() for a data object should also include the number of observations, the number of NA's, and the sd as built in; possibly even the T-stat.  I have programmed my own summary(), but this would be a good idea for everyone.
  6. the parallel core library is superb. finally an easy way for me to use the 8 cores in my Mac Pro! however, mclapply does not make it easy to figure out how many times the function was called. will it take another 100 days, or another 100 seconds to finish? tough to guess. a progress counter would help.  Workaround:
    sleep1sec ← function(i) {
       counter<←counter+1;
       cat("Counter=",counter, " i=",i, "\n");
       Sys.sleep(1); rnorm(1) };
    mclapply(1:1000, sleep1sec) 
    
    but unfortunately each process has its own global counter. still better than no process indicators whatsoever, though.
  7. the two graphics plotting systems should really be replaced by one. I also don't understand the difference between S3 and S4. do we need this?
  8. very minor: sleep() should be a function that sleeps, not a data object that contains how much students slept. Sys.sleep() is what I need. ok, this is a very trivial gripe. but the docs for sleep (?sleep) should say "see also Sys.sleep".
  9. why do I need both source and load ? couldn't these functions detect whether I am reading an .R or an .Rdata file and invoke the right one? both are R objects.
  10. ok.  I need to use the apply family more.  Does anyone have an idea how to abbreviate the common
    mylist← rnorm(5); for (si in 1:length(mylist)) { value← mylist[si]; ...
    
    into something that reads quicker and still has a counter?  In R, having a counter is more important than it is in perl, because iterative list creation via push is discouraged relative to list creation via indexed assignment (for speed reasons, I think).  Maybe a new iterator, like
    iterate(si, value, rnorm(5)) { ...
    
    perhaps? I know this is wild syntactic sugar, but then I write user programs and not computer languages, and I would like my programs to be readable and elegant, more than I would like R to make perfect sense.
  11. why don't more options allow mnemonics? for example, options(warn=1) means what? couldn't it be options(warn="immediately"); or, why not text(...,pos="left"), instead of test(...,pos=2)?
  12. why do some functions wrap quotes around variables? For example, why is it library(something), instead of library("something")? something without quotes should be a variable. same thing for select in subset statements. tell me: what is
    d <- data.frame( a=1:3, b=5:7 )
      a <- "b" ; a2 <- "b"
      subset( d, TRUE, select=c(a) )
      subset( d, TRUE, select=c(a2) )
    ok, if it were always optional to omit the quotes, I would understand it. but it isn't the case with a subset-select, in which I want to delete a variable.  subset(d, TRUE, select= -c(x)) works, but subset(d, TRUE, select= -c("x")) does not.  huh?
  13. merge should also be capable of merging by rowname, not just by columns.
  14. Core Team: Make your life easier!  Would it be possibly to pseudo-wikify the docs? I.e., allow collaborative suggestions? I often think "this ?... should have included a see also to ...", but there is no easy way for me to fix it right there and then for future users. the existing R bug+suggestion system is ok for bigger issues, of course, but painful for these small items. I also wonder whether my suggestions are actually welcome or a distraction. and, of course, I understand that any short edits should require package maintainer approval in the end. my guess is that many suggestions to the docs would be for the better...and, unlike code itself, docs are something that ordinary users can contribute to. after all, it is they who often need and use the docs the most.
Please don't take these as criticisms of the work that the R team has put into R. It is easier for users to gripe than it is to implement features. And obviously I am not willing or able to put these features in myself.

Of course, R has so much magic and so many features that it may well be the case that many of my gripes above already have solutions, but I just don't know them. :-( if you see any, let me know, please. And let me know what gripes should I have complained about that I missed, too.


PS: Next, I need to figure out how to create LaTeX output of regressions and data systematically. there are a few packages on cran, but I am not yet sure which one I want to use.

PPS: I just discovered Doug Bates' blog on Julia. Interesting. I don't miss what Julia provides (e.g., the JIT) too much. the stuff above bothers me more.

8 comments:

  1. 1. You can use .GlobalEnv[["x"]] and then it must be spelled exactly.
    6. Its part of gsubfn, not base, but try: library(gsubfn); m <- "hello"; fn$f("message is '$m'")
    Also see ?match.funfn
    Minor
    4. gsubfn has has had paste0 for years and in R 2.15.0 paste0 is part of R.

    ReplyDelete
  2. Re J: That idiom is not common in R. Your variable "mylist" is not a list in the "R" sense, because it contains all identical fundamental typed values. The correct and common idiom in R for operating on that data is to use vector operations, and if you must do this then use an apply or plyr family function.

    ReplyDelete
  3. Here are some thoughts on your gripes above.

    Number 1, this is reasonable, there is already an option to warn if partial matching is used with the dollar sign, adding a warning if there is no match should be similar, and warnings can be promoted to errors. You should put this in a wishlist with explanation of why you think it would be useful.

    Number 2, the auto recycling is a tricky bit that is easily abused, but if we turned it completely off then we would not be able to do something like myvec + 1 without manually recycling the 1, there are also cases where it is used to add a vector to each column of a matrix. It is also nice for doing something like myvec[ c(TRUE,FALSE) ] to get all the odd elements of the vector without needing to first compute the length. Basically while you are correct that it can cause problems, there are enough uses out there that it is not likely to go away.

    Number 4, I don't see what that gains beyond what the stop function already does.

    Number 5, This is what packages are for and with the package.skeleton function it is pretty easy to turn your functions into a package, then you can access the documentation with ?.

    Number 7, consider the sprintf function instead of paste (not quite as nice as the straight interpolation, but gives more control and you don't need all the quotes).

    Number 8, Here I disagree, R is not Perl and is used differently. This functionality would encourage users to use more global variables instead of keeping things together in lists or environments that should be kept together.

    Number 10, or you could put the function definitions into a package, then load the package before running other code. Yes, programs like Perl do this, but Perl and others were not designed to be run interactively, R can be scripted or interactive (or some combination of both), this advantage comes with other limitations.

    Number 11, Do you mean like variables in package Namespaces? or what you get using the local function?

    Number/Letter A, the load function returns the names of the loaded objects invisibly, if you really want them on stderr then wrap the call in message. If you don't want load to overwrite things in the global environment then you can load into an attached environment, or just use attach instead of load to attach the saved file, then you can access the objects, but they won't overwrite things in the global environment.

    Letter B, the legacy option sounds good, put it in a wishlist. While I agree that attach with data frames needs to be discouraged, it is useful for attaching packages or saved workspaces.

    Letter C, part of this is due to the way that R/S evolved, so while it would be nice to be more consistent, there is enough legacy code that uses different argument names that it will probably not change. There is also the problem that functions like cor need more than just a TRUE/FALSE set of options. I personally don't like the idea of a global option for dealing with missing values. I don't want the computer to change my data without me specifically telling it to.

    Letter D, this is not a good idea at the global level, do you really want the output for all the functions that you did not write to also have their output changed when you specify this option? There is now paste0, or you could write your own wrapper that used a lexically scoped variable to change your calls, but not globally.

    Letter J, this is what the apply functions are for, using sapply does what you specify much simpler than the loop and it takes care of the creating the vector or other object and assigning to the appropriate places in the object/vector.

    Letter L, you can always put the quotes into a library or help call, these are cases where a special shortcut was added for easier interactive calls (and it seems unlikely that someone would load a package by first storing the name in a variable).

    ReplyDelete
    Replies
    1. I wish I had posted this on a wiki, so that this would have been modifiable and extensible (and eventually, a refined version would be nice to go to the R developers). sigh... thanks for your response. some reactions to reactions...


      >> Number 2, the auto recycling is a tricky bit that is easily abused, but if we turned it completely off then we would not be able to do something like myvec + 1 without manually recycling the 1, there are also cases where it is used to add a vector to each column of a matrix. It is also nice for doing something like myvec[ c(TRUE,FALSE) ] to get all the odd elements of the vector without needing to first compute the length. Basically while you are correct that it can cause problems, there are enough uses out there that it is not likely to go away.

      I agree that recycling can be tricky, but this could be a user settable option, too. I would prefer specifying recycling even with x+1 differently when x is a vector and x is a scalar. another option would be to excuse scalars. option(recycle="auto"|"scalars"|"none").


      >> Number 4, I don't see what that gains beyond what the stop function already does.

      as I wrote, better reporting of where the error occurs and, with shorter syntax, encouraging more liberal use.


      >> Number 7, consider the sprintf function instead of paste (not quite as nice as the straight interpolation, but gives more control and you don't need all the quotes).

      yes, I just want syntactic sugar. paste0 helps. easy to live without.


      >> Number 8, Here I disagree, R is not Perl and is used differently. This functionality would encourage users to use more global variables instead of keeping things together in lists or environments that should be kept together.

      agree to disagree.


      >> Number 10, or you could put the function definitions into a package, then load the package before running other code. Yes, programs like Perl do this, but Perl and others were not designed to be run interactively, R can be scripted or interactive (or some combination of both), this advantage comes with other limitations.

      well, then I can just put it at the top, too. this is not an important problem. it would just be nice.

      and, yes, I would love to adopt some great features of perl into R where this is feasible.


      Number 11, Do you mean like variables in package Namespaces? or what you get using the local function?

      I admit that this is not too important, at all. should probably have been excluded. (I like keeping counters of how often I have called a function and then emit a heartbeat every once in a while. but this can be done with a uniquely named global. the disadvantage is that now I am taking stuff out of a global environment, rather than keep everything within the function.)

      Delete
    2. >> Number/Letter A, the load function returns the names of the loaded objects invisibly, if you really want them on stderr then wrap the call in message. If you don't want load to overwrite things in the global environment then you can load into an attached environment, or just use attach instead of load to attach the saved file, then you can access the objects, but they won't overwrite things in the global environment.

      just wanted an option to report to stderr what is loaded, the same way that library often report that they have loaded...or even how R reports that it has started.


      >> Letter C, part of this is due to the way that R/S evolved, so while it would be nice to be more consistent, there is enough legacy code that uses different argument names that it will probably not change. There is also the problem that functions like cor need more than just a TRUE/FALSE set of options. I personally don't like the idea of a global option for dealing with missing values. I don't want the computer to change my data without me specifically telling it to.

      agreed. this should not be mandatory. but my code is just littered with na.rm=TRUE and use="pair". of course, lm() already does this without na.rm=TRUE.


      Letter D, this is not a good idea at the global level, do you really want the output for all the functions that you did not write to also have their output changed when you specify this option? There is now paste0, or you could write your own wrapper that used a lexically scoped variable to change your calls, but not globally.

      agreed.


      >>> Letter J, this is what the apply functions are for, using sapply does what you specify much simpler than the loop and it takes care of the creating the vector or other object and assigning to the appropriate places in the object/vector.

      ok.

      >>> Letter L, you can always put the quotes into a library or help call, these are cases where a special shortcut was added for easier interactive calls (and it seems unlikely that someone would load a package by first storing the name in a variable).

      ok.

      Delete
    3. For number 11 you said that you like to keep counters but would prefer local variables to global for that purpose. The local function allows this, a simple example:

      hbfunc <- local({
      counter <- 0
      function(a,b) {
      counter <<- counter + 1
      if(counter %% 10 == 0) cat('I have done',counter,'additions\n')
      a + b
      }
      })

      Now run hbfunc a bunch of times (with 2 numeric scalar arguments).

      A slightly preferred version might be (eliminates the use of <<-):


      hbfunc2 <- local({
      e <- environment()
      e$counter <- 0
      function(a,b) {
      e$counter <- e$counter + 1
      if( counter %% 5 == 0 ) cat('I have done',counter,'additions\n')
      a+b
      }
      })

      An example of a shared local variable would be something like:


      makecounter <- function(init=0) {
      e <- environment()
      e$counter <- init
      list( inc = function(val=1) (e$counter <- e$counter + val),
      dec = function(val=1) (e$counter <- e$counter - val),
      set = function(val) {old <- e$counter; e$counter <- val;
      invisible(old)},
      show = function() e$counter ) }

      one <- makecounter()
      two <- makecounter(100)

      one$inc()
      one$inc()
      two$inc()
      two$show()
      one$dec()

      Hope this helps,

      Delete
  4. Regarding number 3: I think R should give more information than it currently does, but less than options(error=recover) should be the default. Did you have something particular in mind, i.e. what would you think would be the ideal (feasible) display?

    ReplyDelete
    Replies
    1. simple. It should always display the filename and last line number in the user program where the error occurred.

      Delete