Automated Refactoring

arithma · May 20, 2010

This post will translate eventually into my blog. I will put back a reference back to this thread to give credit for those joining in the conversation.

Informally, given a method, can we break it up into smaller parts that are more comprehensible to the programmer. I want to define this problem formally so that I can come up with a solid algorithmic solution that can be worked out by hand or even possibly becoming automated. In either way, it has to be completely mechanical.
And since we don't have a model of human comprehension, or can't afford evolutionary techniques against people (which as I am writing this, makes even more sense, and could actually be possible). My best bet is to define some constraints that will guarantee a more simple method is output from a more complex one.

The most naive thing the refactorer will do is to return the method as is. The other naive thing it can do is to encapsulate each statement into a new method. Both of these are utterly useless. None the less, we have defined the 'range' of the algorithm, and actually have already bounded the solution (even without having totally defined the problem yet).

The algorithm stability and sanity pivots around another point. If a method is refactored into atomic methods, the algorithm applied to these methods will be an identity operation (will leave them unchanged).

To simplify the whole thing, so that I can start with some lead, we will assume that all the methods we define are actually context-less functions. The form of these functions is: function (input1, input2, input3, ...) => (output1, output2, output3, ...).
This all should be taken into a functional context (there are no state variables, only symbols that are bound to expressions or values).

Using pseudo code, here goes

naive_example: (x1, x2) => (y1, y2)
   y1 <= x2;
   y2 <= x1 * x2;

The output from the algorithm must come out like the following:

naive_example: (x1, x2) => (y1, y2)
   y1 <= atomic1(x2);
   y2 <= atomic2(x1, x2);

atomic1: (x) => (y)
   y <= x;

atomic2: (x1, x2) => (y)
   y <= x1 * x2;

This is the most naive example I could think of at the moment, and I already spotted a lot of problems with the rigidity of the definition of a function that I am using. Additionally there is a problem with singly bound variables that I do want to avoid having. I need variables to be more than once assignable, which can be solved by using multiple instances of a variable (x becomes x1, x2, x3...)

Share your thoughts, I will think more about all of this and expand on this later.

xterm · May 20, 2010

What you're looking for is a solution to the impossible. Trying to automate refactoring without human intervention is something that alot of companies has spent lots of money trying to achieve. The problem occurs specifically in terms of the mechanism not being able to accomodate more than one expressions at a time. In a "pure" functional language this would be trivial given that you're defining your atomic functions the same way you'd define a mathematical function thus every expression is a function in itself. What's worse is that you specifically expect your data to be mutable, which would subsequently force you to do what you've said in the end. Refactoring should really not be that implicit.

arithma · May 20, 2010

http://en.wikipedia.org/wiki/Static_single_assignment_form
This should create the bind between the two paradigms.

Nnablaux · May 20, 2010

What would be nice to do is a system similar to doxygen but for refactoring.
When a programmer is coding he/she specifies each connected component of code inside a certain function. Then the automated refactoring will take that code and re-write it.
Does this make any sense or is it appealing to anyone else?

Kassem · May 20, 2010

Why would I want to re-factor my code? I still don't get it...

Joe · May 20, 2010

Kassem wroteWhy would I want to re-factor my code? I still don't get it...

Because quality comes (at least) as much from your code than it does from your binaries. I have been dealing with some pretty obscure code for the past two days. No matter how well it performs, it is still a massive pain to maintain.

arithma · May 20, 2010

Refactor code to seperate concerns, to understand where there is possible parallelism-exploitable hot spots, and to make code more reusable and understandable.
Refactor a method onto the most rigidly connected components.

This thread deals with refactoring at the function level, there is research and thesis'es on the internet about class hierarchy refactoring as I quickly skimmed through the first few Google results.

arithma · May 20, 2010

nablaux wroteWhat would be nice to do is a system similar to doxygen but for refactoring.
When a programmer is coding he/she specifies each connected component of code inside a certain function. Then the automated refactoring will take that code and re-write it.
Does this make any sense or is it appealing to anyone else?

Example please.
This might be interesting though it incurs a change on the problem domain. (Human Hints)

xterm · May 20, 2010

nablaux wroteWhat would be nice to do is a system similar to doxygen but for refactoring.
When a programmer is coding he/she specifies each connected component of code inside a certain function. Then the automated refactoring will take that code and re-write it.
Does this make any sense or is it appealing to anyone else?

This would thus mean that there's human intervention, as i've said previously, it's impossible to fully automate it without human intervention.

arithma · May 20, 2010

Some new revelations on the rigidity of the problem:
Given a list of expressions (mainly function calls), what are the permutations that would leave the output intact?

Example:

(x, y, z) => (u, v, w):
   u = func1(x, y);
   v = func2(u, z);
   w = func3(u, v);

In this example, there are no permutations to the above that would not change the meaning of the program. (Assuming that we are not in control of func1, 2, 3.. or in the case we don't want to inspect them, or want to change them in the future..).

This example however:

(x, y, z) => (u, v, w):
   x = u;
   y = v;
   z = w;

Any permutation of the list will lead to the same meaning when the function returns.

The cream resides in dependencies.
In the first example:
'u' depends on 'x' and 'y'.
'v' depends on 'u' and 'z'.
'w' depends on 'u' and 'v'.

A dependency topological sorter would quickly show that this is the order of operations that must happen, and the only one that has this particular meaning.

What if we changed the last statement into simply:

w = func3(z);

The dependency is lost, and this statement becomes only dependent on the last input. A simple dependency flood fill like operation on the list of statements (not pertaining even to their order) would show the different operations in this function that can go in parallel and are indeed separable.

arithma · May 20, 2010

@xterm: I refuse to believe this is mathematically impossible. It may be impossible to apply to some context-full languages, but perhaps is extremely applicable to functional languages, or a subset of of some language (const functions with const inputs in C++ for example).

mrmat · May 20, 2010

Just another example of pure laziness. period. :)

I too believe that this is somehow impossible to achieve. The results will not be as good as doing it the human way. The only thing this can help with is time, but i'd prefer quality over that.

arithma · May 20, 2010

mrmat wroteJust another example of pure laziness. period. :)

I too believe that this is somehow impossible to achieve. The results will not be as good as doing it the human way. The only thing this can help with is time, but i'd prefer quality over that.

This tool is specifically for code analysis. I am even sure that it has been implemented somewhere (be it in the compiler, or even in the CPU itself).

Breaking up a method into its separable components is invaluable. It's as much laziness as the laziness involved in not wanting to write ASM code.

I too believe that this is somehow impossible to achieve. The results will not be as good as doing it the human way. The only thing this can help with is time, but i'd prefer quality over that.

If you have a mechanical solution for a mathematical problem, there is no human way any more. At most it would be repeated application of an algorithm manually. Hardly smart.

Somehow impossible to achieve
What the fuck? Where are you getting this from?

Joe · May 20, 2010

I guess the problem is that each developer writes differently. How can you develop something so generic that covers every coding style ? Heck, I am not rigorous enough to write in the same way two different methods.

And if you want to write your original in a fashion that would be 'refactorable', you go back to what xterm was saying: you have human intervention in refactoring.

But very interesting topic, I'd be interested in finding out the results of your experiments, keep us updated

arithma · May 20, 2010

@rahmu: Writing style is hardly an issue here. It's the just the fact that I am thinking about imperative functions without access to global scope or class scope.

if you passed in a single method what would be a possible automatic refactoring technique; you can simply take apart the code path's that are independent of each other and they'll automatically become readable and parallel ready.
it would not take much to alter the mathematical technique into a context-full imperative technique that takes into account the existence of references rather than just read-only input and write-only output.

rolf · May 20, 2010

I dont know if its related, but I was watching ... someone... today who is totally clueless about the bash shell trying to make her way around.
As she tried to change directories and go to /home/rolf, she did
cd home rolf
So I thought, why bother with paths, they look intuitive to us geeks but not to the common people. cd home rolf = cd home then cd rolf is more intuitive and takes as much space.
Also, it seems to me that in the natural language, stuff is centered around the object, yet in computing stuff is centered around the action... funciton.... command whatever you want to call it.
People tend to talk to the computer in an object centered way. For example as she was trying to find and open a text document called sh*t, she wrote:
sh*t
then tried
home shit
rolf sh*t
The right command would be:
less /home/rolf/sh*t
Yet as you can see, priority is given to the command, it comes first, and it will interpret the arguments. I would like to try the opposite.
For example if the user writes:
sh*t
Then the computer starts by looking for an obect called sh*t, first in the proximity (that is in the current directory and context) then once the closest match is found, then it sees what it can do with it.
Maybe what differentiates a geek from the average user, is the ability of the geek to switch "contexts", the same way a parser switches states.
Take for example:
ls -l ./rolf |grep txt\$
How many times do we have to switch contexts when we read this? First ls, then we are in ls argument context, then ./ means local directory context, the |grep, we are in grep arguments context, keeping in mind that it will receive the input of ls....
If we could replace "context switching" by more repetition, as in the first ls example given.
In this case, I guess it would be something like:
Well I dont really know, can someone help? Mostly how can you express the txt$ regex with more basic operations, that is without switching to the "regex context"?
I hope Im not hijacking the thread, I have the feeling you can get something useful out of my braindump.

arithma · May 20, 2010

I still need to figure out two huge aspects before moving into full imperative refactoring techniques. You should know that mathematically what I described so far is pretty possible to factor, automatically and algorithmically efficiently.
The issues are: branching/jumps (while, if, ..) and recursion (consider if it causes any problem).

Perhaps I will write a parser for a simple purely-computational language (integer/double are the only data types) and implement the auto factoring on top of it. Arrays should prove problematic as well to work with.

I guess I am just wishing to do my Masters Degree in Computer Science with all of this theoretical work.