r - Is it ok to use transform to add per row results of operation on data.frame? -

- February 15, 2010

i'm bit confused. routinely use transform this

    ddply(data.frame, 1, transform, new.column = function(old.col.1,old.col.2,...))

this recommended hadley.

but i asked question , hadley stated this:

don't use transform. it's helper function suitable interactive use, not programming with.

so whats wrong transform? think im convinced stupid:

   transform(data.frame,col2=fun(col1)).

but not useful in ddply setting?

there's difference between using transform within ddply , function transform() standalone. far better (and quicker) do:

mydata$col3 <- fun(mydata$col1, mydata$col2)

the function combination ddply/transform useful if have more 1 column change, eg

mynewdata <- ddply(mydata,1,transform,col3=fun1(col1,col2), col4=fun2(col1,col2))

and then, have more flexible option of using within() allows use calculated results calculate next row:

mynewdata <- within(mydata,{     col2 <- fun1(col1)     col3 <- fun2(col1,col2) })

the thing transform() is written used interactively. if use within function, might run trouble. similar subset() in way: they're convenience functions, they're neither fast nor safe use within more complex code.

opinions differ on ddply(). in cases works quick , gives clean , readible code, in other cases consider serious overkill. ddply() works faster , easier when have use non-vectorized functions, in case above options wouldn't work. that, have option use mapply:

mynewdata <- within(mydata, col3 <- mapply(myfun,col1,col2))

mapply can in case quite faster. give basic example:

mydata <- data.frame(col1=rnorm(5),col2=rpois(5,3)) myfun <- function(x,y){     if(y == 0) mean(x) else      mean(c(x,seq(1,y,by=1))) }  code1 <- expression(newdata <- ddply(mydata,1,transform,col3=myfun(col1,col2))) code2 <- expression(newdata2 <- within(mydata, col3 <- mapply(myfun,col1,col2)))  > benchmark(code1,code2)    test replications elapsed relative  1 code1          100    0.50     12.5  2 code2          100    0.04      1.0

the main problem have ddply() order of observations not guaranteed, see in example output below:

mydata              newdata2                    newdata         col1 col2         col1 col2      col3         col1 col2      col3 1 0.07060223    4 | 0.07060223    4 2.0141204 | 0.05658259    2 1.0188609 2 1.84645791    2 | 1.84645791    2 1.6154860 | 0.07060223    4 2.0141204 3 0.05658259    2 | 0.05658259    2 1.0188609 | 0.84119845    1 0.9205992 4 0.89998084    5 | 0.89998084    5 2.6499968 | 0.89998084    5 2.6499968 5 0.84119845    1 | 0.84119845    1 0.9205992 | 1.84645791    2 1.6154860

both functions calculate correct result, mapply() faster in case , preserving order of observations in dataframe.

Search This Blog

OSX

r - Is it ok to use transform to add per row results of operation on data.frame? -

Comments

Post a Comment

Popular posts from this blog

python - ('The SQL contains 0 parameter markers, but 50 parameters were supplied', 'HY000') or TypeError: 'tuple' object is not callable -

c# - Getting per connection bandwidth statistics -

security - SQL injection and web log files -