r - Is it ok to use transform to add per row results of operation on data.frame? -
i'm bit confused. routinely use transform this
ddply(data.frame, 1, transform, new.column = function(old.col.1,old.col.2,...))
this recommended hadley.
but i asked question , hadley stated this:
don't use transform. it's helper function suitable interactive use, not programming with.
so whats wrong transform? think im convinced stupid:
transform(data.frame,col2=fun(col1)).
but not useful in ddply setting?
there's difference between using transform
within ddply
, function transform()
standalone. far better (and quicker) do:
mydata$col3 <- fun(mydata$col1, mydata$col2)
the function combination ddply/transform useful if have more 1 column change, eg
mynewdata <- ddply(mydata,1,transform,col3=fun1(col1,col2), col4=fun2(col1,col2))
and then, have more flexible option of using within()
allows use calculated results calculate next row:
mynewdata <- within(mydata,{ col2 <- fun1(col1) col3 <- fun2(col1,col2) })
the thing transform()
is written used interactively. if use within function, might run trouble. similar subset()
in way: they're convenience functions, they're neither fast nor safe use within more complex code.
opinions differ on ddply()
. in cases works quick , gives clean , readible code, in other cases consider serious overkill. ddply()
works faster , easier when have use non-vectorized functions, in case above options wouldn't work. that, have option use mapply:
mynewdata <- within(mydata, col3 <- mapply(myfun,col1,col2))
mapply can in case quite faster. give basic example:
mydata <- data.frame(col1=rnorm(5),col2=rpois(5,3)) myfun <- function(x,y){ if(y == 0) mean(x) else mean(c(x,seq(1,y,by=1))) } code1 <- expression(newdata <- ddply(mydata,1,transform,col3=myfun(col1,col2))) code2 <- expression(newdata2 <- within(mydata, col3 <- mapply(myfun,col1,col2))) > benchmark(code1,code2) test replications elapsed relative 1 code1 100 0.50 12.5 2 code2 100 0.04 1.0
the main problem have ddply()
order of observations not guaranteed, see in example output below:
mydata newdata2 newdata col1 col2 col1 col2 col3 col1 col2 col3 1 0.07060223 4 | 0.07060223 4 2.0141204 | 0.05658259 2 1.0188609 2 1.84645791 2 | 1.84645791 2 1.6154860 | 0.07060223 4 2.0141204 3 0.05658259 2 | 0.05658259 2 1.0188609 | 0.84119845 1 0.9205992 4 0.89998084 5 | 0.89998084 5 2.6499968 | 0.89998084 5 2.6499968 5 0.84119845 1 | 0.84119845 1 0.9205992 | 1.84645791 2 1.6154860
both functions calculate correct result, mapply()
faster in case , preserving order of observations in dataframe.
Comments
Post a Comment