Tuesday, April 3, 2012

R Annoyances and Gripes

R is a superb data analysis and graphing program. It has a solid programming environment. It has great docs, although I wish it had even more "see also" and more "examples" in the man pages. many CRAN packages are great. the R team members and folks answering r-help posts are saints, though some of them can be quite grumpy.

[I try to integrate some of the comments from below.  thx everyone.]

Alas, because R is so good, I am probably too tempted to write too many programs in R. Unfortunately, for a generic programming language, R still lacks some sugar.

  1. It should be possible for a user to set an option that forces aborts immediately upon read access into a non-existing list item or data frame column (i.e., if this data frame or list that has not yet been initialized). instead, R simply silently yields NULL. Try it!
    .GlobalEnv$misspelled
    
    or even
    .GlobalEnv$misspelled[100] 
    
    do not cause a program abort. good luck finding the error later.  Gabor also suggests using the .GlobalEnv[["misspelled"]] construct, which does check.  Still, the '$' syntax is often so much nicer....
  2. It should be possible for a user to turn automatic recycling off. I have programmed bugs which I could have caught earlier if the program had bombed on non-conformable assignments. why do we need BY DEFAULT  the following to work?
    v←1:4
    w←1:8
    z←v+w
    
    Really? Are you sure it is what you wanted? If it really is, you should set an option first, or better yet, rep(v( yourself.  there should be an option(recycle="default"|"scalars-only"|"never").  you can even leave the current default, but allow *me* to turn this off.  over time, packages may switch their model, and, by default, everything should still just work without changes.
  3. An error should always clearly identify the lines in a program where it occurred.
    Simple Fix: Turn on
      options(error=recover)
    again and run my program again to understand where my program bombed.
  4. An assert()should be built into the R language---assert should be like a stopifnot() function, but with a cat message function. So, I could write
    assert( ncol(x)==ncol(y), "Sorry, but ncol(x)=",ncol(x)," != ncol(y)=", ncol(y), "\n")
    the big advantage of building assert into the language, rather than having it user defined, is that an assert error will bomb showing the correct line, not an error inside the assert function, requiring a stack backtrack.  it could even drop you into the right stack-frame at that time---an advantage currently only for the main control logic but not logic inside functions is the immediate ability to examine the variables that were associated with the failure.

    Programming by contract would be even better. (package testthat has some testing functions, but it's not the same.)   Perhaps a novel pretty syntax, like

    [ ncol(x) != ncol(y) => "Sorry, but", ncol(x), "is not", ncol(y) ]
    

    would be even better, abusing both the '[' when it starts a statement and and '=>' . Such readable and SHORT syntax, when paired with a clear error on which line the error occurred when it triggers, would encourage everyone to use a lot of assertions, including at each function start and end.
  5. there is no way for an end user (not library writer) to add his own function doc to the set of docs that one can interrogate with '?' request for doc about a function.  suggestion: use package skeleton for building full packages.  nothing for end users, though.
  6. similar---can we please add something akin to the perl6 pod to R?  please adopt some good features from perl, even though R is of course not perl.  PS: and where/what are the standard filename conventions? I think it is .Rdata for data files. right? (Is it .Rh for R inclusion files?)
  7. nice but unnecessary: Syntactic sugar: would it not be great if '$' inside a string would do a paste-with-autocollapse? How ugly is it to call my function as
    f( paste("msg is '",m,"'", collapse="") )
    
    compared to the more readable
    f("msg is '$m'")
    
    ?  but the simpler paste0 goes a long way...
  8. There should be an easier way to assign from a list to multiple objects
    (a,b) ← f()
    
    Gabor G created a great R function that facilitates this, but something like it should be built in.
  9. R should be smart enough to understand one special case to allow quickly assigning to individual elements in a data frame. Believe it or not, but
      bigframe$a[12]←12
    
    will actually copy the bigframe object a few times, instead of just replacing the single cell's content. if bigframe is big, you can count the number of assignments per second on an Intel core 2012 system. yikes! of course, I know this now, but all novices probably learn this the hard way. they may get the impression that R is slow when they (like I did before I knew) write,
      for (i in 1:10000) bigframe$soln[i]← uniroot(myf, c(-Inf,Inf), i)
    
    But
      for (i in 1:10000) soln[i]← uniroot(myf, c(-Inf,Inf), i);
      bigframe$soln←soln; rm(soln)
    
    is reasonably fast. (The data.table R package is one way to get around this, but again, this is not what novices would know.)  This is important to fix IMHO.
  10. nice but unnecessary.  I would love to have many of my functions defined at the end after my control logic at the start of my R code. thus, I would love it if I could instruct R to scan the source file for functions first before executing---as it is in perl. This behavior could be enabled by a user and not be the default.
  11. probably a bad idea on my part.  there should be "static" (local persistent) variables in R. Use very sparingly, of course.  I would use it for in-function caching of previously seen/handled cases.

Some more minor gripes:
  1. load() should have an option to write to stderr what objects are being loaded. in fact, such stderr output should probably be the default to remind the user, just as library loading usually does.  load() should also have an option to check its environment first to see if the object already exists, and load it only when still needed...sort of like a library() invocation.
  2. there should be a legacy option, without which demoted features cause warnings and errors...and let's demote attach and detach.
  3. if
      option(na.rm=TRUE)
    
    then mean(c(1:5,NA)); cor(c(1:5,NA), c(rnorm(6)) should give values. I have read about na.action, etc, but I could never figure out how to make it work.  lm() ignores NA even silently without any option.  why one and not the other?  of course, it doesn't have to be fully consistent from day 1.  but let's just get it started.  mean, sd, var, median, summary could be fixed.
  4. bad idea on my part.  paste0 is better. There should be
    option(pastechar=" ")
    option(catchar=" ")
    
    to allow setting of the default sep for paste and cat. Incidentally, I like space to be the default to separate the elements in a vector, but I do not like space before and after separate arguments to paste and cat. I can add the latter myself more easily.
  5. The commonly used descriptive summary() for a data object should also include the number of observations, the number of NA's, and the sd as built in; possibly even the T-stat.  I have programmed my own summary(), but this would be a good idea for everyone.
  6. the parallel core library is superb. finally an easy way for me to use the 8 cores in my Mac Pro! however, mclapply does not make it easy to figure out how many times the function was called. will it take another 100 days, or another 100 seconds to finish? tough to guess. a progress counter would help.  Workaround:
    sleep1sec ← function(i) {
       counter<←counter+1;
       cat("Counter=",counter, " i=",i, "\n");
       Sys.sleep(1); rnorm(1) };
    mclapply(1:1000, sleep1sec) 
    
    but unfortunately each process has its own global counter. still better than no process indicators whatsoever, though.
  7. the two graphics plotting systems should really be replaced by one. I also don't understand the difference between S3 and S4. do we need this?
  8. very minor: sleep() should be a function that sleeps, not a data object that contains how much students slept. Sys.sleep() is what I need. ok, this is a very trivial gripe. but the docs for sleep (?sleep) should say "see also Sys.sleep".
  9. why do I need both source and load ? couldn't these functions detect whether I am reading an .R or an .Rdata file and invoke the right one? both are R objects.
  10. ok.  I need to use the apply family more.  Does anyone have an idea how to abbreviate the common
    mylist← rnorm(5); for (si in 1:length(mylist)) { value← mylist[si]; ...
    
    into something that reads quicker and still has a counter?  In R, having a counter is more important than it is in perl, because iterative list creation via push is discouraged relative to list creation via indexed assignment (for speed reasons, I think).  Maybe a new iterator, like
    iterate(si, value, rnorm(5)) { ...
    
    perhaps? I know this is wild syntactic sugar, but then I write user programs and not computer languages, and I would like my programs to be readable and elegant, more than I would like R to make perfect sense.
  11. why don't more options allow mnemonics? for example, options(warn=1) means what? couldn't it be options(warn="immediately"); or, why not text(...,pos="left"), instead of test(...,pos=2)?
  12. why do some functions wrap quotes around variables? For example, why is it library(something), instead of library("something")? something without quotes should be a variable. same thing for select in subset statements. tell me: what is
    d <- data.frame( a=1:3, b=5:7 )
      a <- "b" ; a2 <- "b"
      subset( d, TRUE, select=c(a) )
      subset( d, TRUE, select=c(a2) )
    ok, if it were always optional to omit the quotes, I would understand it. but it isn't the case with a subset-select, in which I want to delete a variable.  subset(d, TRUE, select= -c(x)) works, but subset(d, TRUE, select= -c("x")) does not.  huh?
  13. merge should also be capable of merging by rowname, not just by columns.
  14. Core Team: Make your life easier!  Would it be possibly to pseudo-wikify the docs? I.e., allow collaborative suggestions? I often think "this ?... should have included a see also to ...", but there is no easy way for me to fix it right there and then for future users. the existing R bug+suggestion system is ok for bigger issues, of course, but painful for these small items. I also wonder whether my suggestions are actually welcome or a distraction. and, of course, I understand that any short edits should require package maintainer approval in the end. my guess is that many suggestions to the docs would be for the better...and, unlike code itself, docs are something that ordinary users can contribute to. after all, it is they who often need and use the docs the most.
Please don't take these as criticisms of the work that the R team has put into R. It is easier for users to gripe than it is to implement features. And obviously I am not willing or able to put these features in myself.

Of course, R has so much magic and so many features that it may well be the case that many of my gripes above already have solutions, but I just don't know them. :-( if you see any, let me know, please. And let me know what gripes should I have complained about that I missed, too.


PS: Next, I need to figure out how to create LaTeX output of regressions and data systematically. there are a few packages on cran, but I am not yet sure which one I want to use.

PPS: I just discovered Doug Bates' blog on Julia. Interesting. I don't miss what Julia provides (e.g., the JIT) too much. the stuff above bothers me more.

Sunday, May 15, 2011

What Android tablets could do better than the Ipad

How can Android beat the iPad 2?  Tough.  The iPad install base is large, and the iPad is simply a great product.  But the iPad 2 is not perfect.  In any case, if Android does not a better product, it will never catch on.  So, from the perspective of an adult user (i.e., not for primary use as a gaming machine), what could an Android tablet do better than the iPad 2, which would lead me to trade mine in?

  1. Better cameras.  The ipad 2 cameras are awful, even for skype.  I don't mean (just) the resolution.  I mean the lens angle.  It is not wide enough.  You have to hold the Ipad about 4 feet away from your face in order not to look like a moon face.  It also means that inevitable shaking makes the system worse.
  2. Better software developers.  For example, tap into better system software through outside ideas.  Allow awesome third-party system software to take control of the iPad--but only if it has been carefully checked and vetted.  (For example, I would buy a tablet that allowed me to set up a "point system" for my kids: if you do educational games for x hours, you get to play any kinds of games for y hours.  No one other than Apple can implement this on the iPad.)  Offer a venue for Android developers.  Oh, and build faith among your developers.  I don't mean be static in terms of always sticking religiously to legacy interfaces.  See, most external Apple developers do not trust Apple, and rightly so.  They understand that Apple may pull the rug out from under them at any moment if Apple finds it in its own interest.  Apple's predatory behavior vis-a-vis its developers is Android's single biggest asset, and Apple's single-biggest weakness.  If Android were just competitive, every developer would prefer putting their stakes with Android, and not with iOS.
  3. Better web browsing.  Safari sucks.  I want real tabs.  Background loading, even if I exit it.  (Flash is not half as important as the basic experience.)  Since I am at it, make sure the other base software for adults is better, too.  Offer a better skype and email client than what exists on the iPad.  (Skype, where is the iPad client?  Why am I running an iPhone skype??)  Lure all the magazine and book publishers to go to Android.  Work with Amazon.
  4. Offer some better technology.  Take some risks to lock up something unique.  Offer a truly foldable tablet, where the fold is seamless.  Or a flexible tablet.   Or a sun viewable tablet. Wireless charging. (Retina display?  Who cares.  The ipad is plenty readable, even today.  The resolution battle and battery life battles, like the CPU battles of old, are mostly over.  1000x1000 on a 10" tablet is decent.  8 hour battery life is decent.  Yes, you can be better, but this is not what will make or break the next system.)   There are plenty of better tech solutions that have failed.  But there are few solutions without a compelling advantage that were able to overtake a market leader.  Apple is a smart leader.  They learned how important market share is fighting Intel and Microsoft.  Apple now has the iOS software base.  They have the user base.  Business as usual just won't work for Android.  Android just has to become better than the iPad, or it will never catch on.
  5. Offer an optional "pen" mode for more accurate drawing with a stylus.  Offer voice recognition software deeply embedded in the system.  

Of course, the whole tablet experience has to be right, too.  Be as good when you can be.  The device should be as thin and nice (and crapware-free) as the iPad.

PS: This is also why google TV failed---it has to be simple and integrated.  A TV that has everything seamlessly integrated, without cables, complex menus, etc.  The DVD, Bluray, DVR, etc., all seamless.  One remote control.  My grandmother would have to be able to operate it.

The Secret To Making Great Movies

Sadly, there is no secret to making great movies.

The "secret" is having a good story.  A story that is interesting.  A story in which it is not obvious what will happen next, yet you can hardly wait to see what will happen next.  A story in which everything makes sense (in the end).  A story which is believable.  A world.

This is why good theater works, even though it is on a small stage in an obviously unbelievable setting.  The story must be engaging.  Think the "Usual Suspects."  Or "The Lives of the Others."  Or "No Country for Old Men."  Or "Snatch."  Or "The Wire."  Or "Downton Abbey."  Or many other movies and series that did not cost an arm and a leg to make.

You don't need a great director, great actors, high production budgets, or special effects.  Yes, these can help.  But a great story will make mediocre "everything else" appear great.  A boring story will make great "everything else" appear mediocre.

Of course, some stories may intrinsically require a great director, great actors, huge production budgets, or great special effects.  For Bladerunner, the feeling of future LA was vital.  For the series Rome, a believable Rome 2000 years ago was vital.  For the liquid metal robot in Terminator II, the special effect was vital.  (But note that Terminator I, which is just as good, was made on a shoestring.)  For Lawrence of Arabia, how could you film this, if not in the desert with hundreds of actors and extras?   For the Godfather, it had to be Martin Scorcese.  And Al Pacino.  And...

But, in the end, nothing other than a good story really matters.

So, why does Hollywood--and, worse, network television--produce so much shit?  It's because Hollywood is not out to make great movies.  It's out to sell movies.  If movies like "Independence Day" and reality TV sells, then this is what will be produced.

Of course, I think that Hollywood is also too short-sighted.  Making a good-story movie is a larger risk than making "The Matrix 5" or "Spiderman 8."  But, a new world with a new story can itself create more spinoffs.  Of course, even a good film (like Rocky 1) will then warp into a bad one (like Rocky 14), but I can live with this.

And, of course, convincing the folks providing the cash is easier said than done...

Wednesday, January 19, 2011

(La)TeX Advantages and Disadvantages


Advantages

  • In wide use.  Universal.  Free.  Easy to install everywhere.
  • Good Infrastructure (emacs support, typesetting stage, pdf output; many users)
  • Beautiful output
  • The LaTeX companion book
  • The only structured text-based word processing system (in wide use?) with good math support
  • "Easy typing" oriented---XML is painful
  • auctex support in emacs with nice highlighting 
  • Mature --- no bugs in base.  few if any bugs in packages
  • Many, many packages on ctan 
  • Many helpful souls on comp.text.tex
  • Defining simple user macros is easy
  •  
    There simply is no alternative with its feature set.  (There are alternatives for smaller tasks, e.g., WYSIWYG for letters, etc.)

    Disadvantages

    • Based on ancient macro language, which only very few wizards still understand.  When they disappear, TeX will die.  It is already happening slowly.
    • Insane syntax (or shall we say non-syntax).  User documents could instead be sanely defined with a grammar, while keeping TeX bowels hidden.  User text difficult to parse: not clear how many arguments each command has (see below).
    • Weird catcodes
    • Incomprehensible error messages
    • No easy to understand end user programming language for complex tasks
    • No clear hooks for external programs (e.g., a preprocessor)
    • No stdin support
    • Poor namespace in macros (no '.', digits, '_' in macro names)
    • Strange meta characters.  '%' for comment, '#' for macro expansion.  $ for math---why not \m{math} to free up the $ sign for what it is?
    • No clear separation of content and markup
    • Syntax changes require knowledge of both emacs auctex and tex macro language, neither of which are very easy to learn.  error messages would have to be sane.  usage would have to be wide.
    • Painful font installation  (No "drop font here and it will work.") 
    • No multithreading and multiprocessor support (even for multiple \include{} files)
    • No definitive authority.  Knuth has pretty much abandoned it, and TUG does not have the resources (nor believes that it has the right) to abandon old and obsolete features.  It's as if Larry had disappeared and perl remained stuck forever at version 1 or version 2.    
    Unfortunately, conTeXt is not ready (we spent a year trying to get a complex book to typeset, but ultimately gave up.)  conTeXt is also based on the TeX messy macro language.  thanks to Hans Hagen and his team for pushing the envelope, though.


    Small Syntax Gripes

    • $ should be the dollar symbol.  It is common enough.
    • math should be typeset with \[ \] or \m{ math }, instead of '$$' and '$'
    • # is an uncommon character.  good.  one or two of these should be the comment character, not %.  % should be percent.  It is a fairly common character.
    • I should know what an argument to a macro is by looking at it.  where is the argument to \sqrt3 ?  you can know this only if you know the definition of \sqrt.  what is the argument to \mymacro{1} ?  Is '{1}' an argument or a block following a macro?  Again, you need to find the definition of \mymacro. 


    What I Need

    A sane-syntax markup language designed for easy typing (i.e., not XML) with an auctex emacs plugin, with a syntax check BEFORE the typeset, with an extended namespace for commands (or macros),  with an ability to use most ctan packages..  Separation of markup from formatting.  HTML or pdf generation.  A user community for this.  Books written for it.


    Your mileage may vary, of course.  And to be clear, it is easy to point out weaknesses.  It is hard to put something together.  The folks who have made TeX and LaTeX what it is have to be commended.