Saturday, July 13, 2013

Evaluating new tools/systems for project development.

Tag: SVN vs Perforce

I have to make a decision to select between SVN, Git and Perforce when creating a new project in Assembla. I know SVN well in previous work. I have been using Perforce extensively for work in the past 9 months. I have never type in a single Git command yet.

The final decision is Perforce. I concern a bit about the network requirement of Perforce due to it centralised workflow. I know how to make local changes without network and reconcile the workspace  later when network is available. Still this is not normal in the Perforce world.

Git is the first one out. The documents and terminologies are just not fit for my brain. And I don't need all this distributed feature. I am foreseeing only a handful of developers collaborating in the project. I don't think this Git thing is going to work for my future projects either. Just my personal opinion.

SVN is good enough. I don't think, as a version control backend, there is anything missing in SVN for me. The issue is on the front end side. I am now so used the the pretty UI in Perforce when tracking changes. I perform all the actual version control work in command line. But when it comes to 'Who had done what in where at when?' and about 'Where is the origin of this code and how it gets here?', the Perforce UI is a clear win.

There are front ends for SVN providing similar feature but the good ones are all paid software. Then why should I not using Perforce as a single, well integrated package?

During the decision making process, I come across somebody's suggestions which should also be valid for most tool/system selection situations. I will keep these in mind in the future.


  1. performance on typical usage, especially for remote sites
  2. resource requirements
  3. importance of the changes needed in the working habit
  4. support availability and cost

Friday, July 12, 2013

A serial bus for core configuration

In many of my pervious work using FPGA as an accelerator, I came across this situation.


  1. I need to configure the cores (usually parallel core with same or similar architecture) by writing configuration/parameter registers to the core.
  2. The same type of registers are use to sample the results or status of the cores.
  3. The master controlling the cores (by reading/writing the registers) is usually a soft CPU or a interface to the host PC.


Given the fact that all cores are in the same FPGA and we have balanced clock tree within the FPGA, de-skew and resynchronisation between the the master and the cores. This make things much easier. NOT!

In FPGA design, these registers are usually implemented in distributed memory (i.e. DFF primitive). This is sometime necessary since the contents of these registers are requires simultaneously (e.g. the parameters of a FIR filter). Then it is straight forward to allow them to be written simultaneously.

In an example design, we have 10 cores. Each core has 10 configuration/parameter/result registers. All registers are 32-bit. This is a design with 32*10*10=3200 signals (ignoring the read/write controls) just for the purpose of infrequent data communication.

Here comes the problem: The connection between the master and the cores make it difficult to meet the timing constraints or something impossible to pass P&R. We simply spend too much routing resources for something not relate to design performance.

Now you are thinking using a bus system to connect these registers to the master. And you create the address decoder in each core. This will relax the routing channel congestion since the number of signals now depends of the number of cores but not number of registers.

But the timing issue is still not solved. Putting a global decoder in the master and connect all registers in a single parallel bus will not work. First, some FPGAs have no tristate buffers for internal connection. So you end up in using more signals (for sending the read data back to the master). Also, the fan-out of the master output will be too high since it is going to drive all the cores. More importantly, we usually fill up the FPGA with parallel cores as much as possible. So these cores are all over the chip in different location. Thus the connection to the furthest away core dominate the critical path.

You are upset by the fact that you have a highly optimised design for computation but slowed down by a configuration bus. There are two options from here: to pipeline the bus between master and cores; or to use a slower clock for the configuration part.

The second approach will solve the issue once for all. But we have limited clocking resources in FPGA which is more valuable than DFFs and LUTs. It is also not easy to implement the synchroniser in FPGA. Using async BlockRAM for a handful of registers is another big waste. At the end, you also need to set special constraints for the STA tool to get the timing closure correct. If you are willing to go through all these troubles, why not just set the bus as multi-cycle path in timing constraints? Anyway, you cannot avoid asynchronous design in FPGA following this path.

The first approach is actually easier, with the help of auto retiming from the EDA tools. All it costs are some DFFs which should not be an issue in modern FPGAs. The issue is the latency which impacting the bus protocol. Since there is a latency between the write signal and the actual update of the registers in the cores. For example, the master must wait until the parameters are actually updated before sending the 'go' signal (usually a single wire) to the cores. It get more complicated if any kind of acknowledgement is required for the write operation (e.g. the full signal in FIFO interface, etc.).

What I am proposing here is a serial bus which has the following advantages:


  1. Running under the same clock shared by the master and the cores.
  2. Minimum number of signals (only two) between master and each core.
  3. Minimum control logic required for implementation.

It is going to have similar disadvantages as in the first approach above. But it can still save a lot of routing resources. It is also suitable for ASIC design where flip-flops are more valuable than in FPGA and the cost of retiming every buses is too high.

The details of this serial bus will be presented in the next post.

Wednesday, May 01, 2013

Cost Effectiveness of Acceleration

I have done may acceleration and optimisation jobs on various platforms. These include FPGA RTL level design, CUDA programming and parallel/distribute CPU cluster design.

It usually costs me weeks, if not months, to optimise a design in implementation level. I also need to know the hardware platform very well. This learning time can be considered as a one-off cost, for each platform. Finally, the optimised implementation needs constant upgrade to fully utilise the ever advancing computing platforms (CPU, GPU, compilers?!).

The question I keep asking myself is: 

    Is it wise to spend so much time and effort on optimisation or acceleration?

My answer is: It is wise if your application fits in (at least) one of the following categories:

  • Time is money.

Literally, time can be translated directly into currency. In some business, a few seconds advantage can easily payoff the salary of the engineering team. I have see real live examples and some of my friends are working exactly in these areas (high frequency trading, oil/gas exploration, etc.). As far as I know, both the bankers and the engineers are happy about the outcome. And the business is somehow forced to go in this direction: if somebody else is making money faster in the same market, it usually means that you are getting less at the same time.

But the value in time also easily renders the results valueless in short time. When it is so critical to produce a useable result before the raw data becomes out-dated, the latency becomes the bottleneck. Conventional high performance computing (HPC) methodologies, which emphasis on throughput, are not applicable here. And the physical limitation (e.g. the speed of light) will eventually stall the race to the lowest latency.

My opinion: It is a fast growing area but it may grow to its end sooner than we thought.

  • Repetitively running jobs.

Considering all the overheads in a real word application, including disk I/O, memory copies, process synchronisation and data preparation, we seldom see over 10x speedup in overall execution time. It's simple math that if the portion (in term of execution time) of the application which can be accelerated is less than 90%, the maximum achievable speedup is already less then 10x. If the resulting application is run once per month, most people won't care if it is 2 hours or 20 hours.

But if the resulting application is ran every hour by every staff of a reasonable size team, even a humble 1.5x improvement will save you a significant amount of man-month in the business. Also, the target end users in this category are easily satisfied by that 1.5x speedup, for a very long time. (I will be crazily happy if Xilinx would speedup the place-an-route process for 1.5x.) This category also includes the software industry where copies of a single product is sold in tens of thousands.

My opinion: It is worth doing but you don't usually see big excitement in it.

  • Framework development.

This is for the developers or consultants who is planning to make a living in application acceleration. It is worth to plan well for each platform and create a framework which you can reuse in later projects. Again, it is a larger up-front payment for repetitive (development) jobs.

Apart form these, I don't see why one should spend weeks in acceleration. Pay for a good compiler, use an optimised library, and play with the compilation options will easily give one big improvement.

Sunday, February 17, 2013

Resize image files in batch (in Mac)

Tags:

JPEG resize, image resize, Mac OS X, Mountain Lion, Preview


Problem:

I have a few photo (taken by my digital camera) and want to upload them to Picasa. But the images are too large (in both geometry size and file size). I want to scale them down before uploading.


Solution:

1. Select all files in a Finder window (Command-click or Shift-click for multiple files).
2. Right click on one of the selected file and select "Open With" -> "Preview".
3. In the Preview window, multiple files are shown as multiple pages on the side bar on the left.
4. Select all opened files by "Command-A" on the side bar.
5. Select from the menu bar: "Tools" -> "Adjust Size ...".
6. There you can change size. I selected the "Fit into: 1280x1024 pixels". Yours may be different.
7. Click "OK" to close the Adjust Size window.
8. Finally, "Save" the files and quit Preview.


Result:

The files are changed from 5MB to 256KB in file size while displayed perfectly in the web browser.

Thursday, January 31, 2013

Get WAN IP in one line command

If the local host is behind firewall and/or there is NAT in between, ifconfig may not report the correct IP address which is seen externally.

The following (one line) command prints out the WAN side IP.

# curl -s http://checkip.dyndns.org | sed -e 's/.*: //;s/<.*//'

Sunday, January 13, 2013

Expect: interact with external program with script

The following script calls an external program and interact with this external program by controlling the inputs and outputs of it.

The script first create call the external program 'dc', which is a postfix based calculator.
Then it send the string "4p\r" as input to the 'dc' program.
In the expect block, the script try to match two patterns: "0\r" and any digit followed by a carry return.
If the output of 'dc' is 0, then the script terminates it by sending the command "q".
If the output of 'dc' is any digit larger then 0 (starting from 4 in this example), the value will be decreased by sending "1-p" to 'dc'.


#!/usr/bin/expect -f

spawn dc      
send "4p\r"   

expect {

  "0\r" {
    puts "Expect: matched $expect_out(0,string)";
    send "q\r"
    puts "Expect: exit";
  }

  -re {[1-9]\r} {
    puts "Expect: matched $expect_out(0,string)";
    send "1-p\r";
    puts "Expect: reduce by 1";
    exp_continue
  }

}

Running the script will generated the following output on screen.


$ ./script.exp
spawn dc
4p
4
Expect: matched 4
Expect: reduce by 1
1-p
3
Expect: matched 3
Expect: reduce by 1
1-p
2
Expect: matched 2
Expect: reduce by 1
1-p
1
Expect: matched 1
Expect: reduce by 1
1-p
0
Expect: matched 0
Expect: exit


Notes:

1. Except captures all output from the external program including '\r'.
2. Variable $expect_out(0,string) stores the matched string from the immediate previous matching.
3. Variable $expect_out(buffer) stores the remaining output of the external program by removing all characters up to (and including) the matched string from the immediate previous matching.

Tuesday, January 08, 2013

Interface between C and TCL : Case III

Extending TCL script by C functions. The C functions are compiled as dynamic loaded library and is called from within a TCL script. The script can be interpreted by a standard TCL shell.

The C library is listed below.

#include <stdio.h>
#include <tcl.h>

int my_cmd1 (ClientData cdata, Tcl_Interp *interp,
    int objc, Tcl_Obj *const objv[]) {
  printf("Hello, world!\n");
  return TCL_OK;
}

int my_cmd2 (ClientData cdata, Tcl_Interp *interp,
    int objc, Tcl_Obj *const objv[]) {
  printf("Hello, again!\n");
  return TCL_OK;
}

// The function name must matches the dynamic library name.
// With the first letter in capital form and a "_Init" postfix.
int My_cmd_Init (Tcl_Interp *interp) {

  if (Tcl_InitStubs(interp, TCL_VERSION, 0) == NULL)
    return TCL_ERROR;

  Tcl_CreateObjCommand(interp, "my_cmd1", my_cmd1,
      (ClientData)NULL, (Tcl_CmdDeleteProc *)NULL );

  Tcl_CreateObjCommand(interp, "my_cmd2", my_cmd2,
      (ClientData)NULL, (Tcl_CmdDeleteProc *)NULL );

  return TCL_OK;
}

To compile this file into a dynamic library (in Mac OS X 10.8.2), run the following line:

gcc -Wall -shared -o my_cmd.dylib my_cmd.c \
  -undefined dynamic_lookup -rdynamic

The expected output is:

$ tclsh
% load ./my_cmd.dylib
dlsym(0x7fb5dbe02360, My_cmd_SafeInit): symbol not founddlsym(0x7fb5dbe02360, My_cmd_Unload): symbol not founddlsym(0x7fb5dbe02360, My_cmd_SafeUnload): symbol not found
% my_cmd1
Hello, world!
% my_cmd2
Hello, again!
% exit

The "symbol not found" error can be safely ignore for this simple example.

Interface between C and TCL : Case II

C main function create an interactive TCL shell. The program terminates after the TCL shell is terminated and will not return control to C.

The C main function is listed below.

#include <stdio.h>
#include <tcl.h>

// A dummy initialisation.
int Tcl_AppInit(Tcl_Interp *interp) { return 0; }

int main(int argc, char *argv[]) {

  printf("In C start\n");
  Tcl_Main(argc, argv, Tcl_AppInit);
  printf("In C end\n");   // This line will never be executed.

  return 0;
}

The program is compiled by:

gcc -Wall -o main main.c -ltcl

The expected output is:

$ ./main
In C start
% puts "hello"
hello
% exit

Interface between C and TCL : Case I

C main program calls TCL interpreter to run an external TCL script. C and TCL communicate through variable values. TCL can call C function.

The C main program is listed below.

#include <stdio.h>
#include <tcl.h>

// a function to be called from within the TCL script
int my_func (ClientData data, Tcl_Interp *interp,
    int objc, Tcl_Obj *const objv[]);

int main(void) {

  Tcl_Interp *interp = NULL;

  interp = Tcl_CreateInterp();
  if (interp)
    printf("In C: TCL interpretor started.\n");

  Tcl_CreateObjCommand(interp, "my_cmd", my_func, 
      (ClientData)NULL, (Tcl_CmdDeleteProc *)NULL );

  printf("In C: TCL script begin.\n");
  if (Tcl_EvalFile(interp, "simple.tcl") == TCL_OK)
    printf("In C: TCL script end.\n");

  printf("C says: Hello, %s!\n", Tcl_GetVar(interp, "name", 0));

  Tcl_DeleteInterp(interp);
  if (Tcl_InterpDeleted(interp))
    printf("In C: TCL interpretor stopped.\n");

  return 0;

}


int my_func (ClientData data, Tcl_Interp *interp,
    int objc, Tcl_Obj *const objv[]) {
  int i;

  printf("In C: my_func() started.\n");
  printf("  obj[0] = %s.\n", Tcl_GetString(objv[0]));
  printf("  obj[1] = %s.\n", Tcl_GetString(objv[1]));
  if (Tcl_GetIntFromObj(interp, objv[2], &i) == TCL_OK)
    printf("  obj[2] = %d.\n", i);
  else
    printf("  obj[2] is not a integer!\n");
  printf("  obj[3] = %s.\n", Tcl_GetString(objv[3]));
  printf("In C: my_func() ended.\n");

  return TCL_OK;
}

The TCL script is listed below.

#!/usr/bin/tclsh

puts "Tcl calls: my_cmd (implemented as a C function)"
my_cmd abc +123 xyz

puts "----"

puts "Tcl asks: What is your name? "
gets stdin name

The program is compiled by (in Mac OS X 10.8.2) :

gcc -Wall -o main main.c -ltcl

The expected output is (in Mac OS X 10.8.2) :

$ ./main 
In C: TCL interpretor started.
In C: TCL script begin.
Tcl calls: my_cmd (implemented as a C function)
In C: my_func() started.
  obj[0] = my_cmd.
  obj[1] = abc.
  obj[2] = 123.
  obj[3] = xyz.
In C: my_func() ended.
----
Tcl asks: What is your name? 
Brittle
In C: TCL script end.
C says: Hello, Brittle!
In C: TCL interpretor stopped.