next up previous contents
Next: ScaLAPACK version Up: LAPACK version Previous: LU factorization

Overhead

Overhead in construction, copy and destruction show that there exist both similarities and differences between the DEC Alpha (Figure gif and Tables gif, gif and gif) and the Intel Paragon (Figure gif and Tables gif, gif and gif). Both overheads for the construction and copy of matrices increase linearly with the memory required, as expected, but while the time to copy is double that of construction on DEC Alpha, there is much smaller difference in execution time for these operations on the Intel Paragon. This difference may again be caused by the Intel Paragon's pipelining of memory reads, which the DEC Alpha cannot do.

Another difference is seen in the time needed to destroy the matrix. Intel Paragon uses constant time for this, while the time on DEC Alpha is directly proportional to the amount of memory. This may be caused by the DEC Alpha's system of allocation being page based, and that each page must be individually released during deallocation.

Since SymPLA uses temporary matrices in a number of operations, it begins to lose performance earlier than LAPACK, because the necessary preparations, and duplication of data takes time. In addition the extra use of memory causes swapping to start at smaller matrix dimensions than for the LAPACK benchmarkers, especially for matrix multiplication and LU factorization.

Timing the creation of submatrices turned out to be a problem, and is not included in the reports, as the timings indicated construction in less than a millisecond or less on the DEC Alpha, whose timer has millisecond resolution, and similarly for the Intel Paragon. The construction overhead for submatrices may therefore be considered negligible, as is the destruction overhead, compared to the construction overheads.

Operations on submatrices that require duplication of data into temporary memory have been observed to operate at very reduced performance levels, at about less than a tenth of the addition performance shown here, and a somewhat reduced performance ( tex2html_wrap_inline2677 ) in other operations. As this implementation has support for using whole matrices and continuous submatrices directly in the operations, operations on such matrices should not suffer such reductions in performance. It is possible that better ways to handle operations involving random submatrices can be found, and such methods should involve as much preselected settings as possible.

tabular1634


Summary of SymPLA addition on the 233 Mhz DEC Alpha

 

tabular1642


Summary of BLAS/ScaLAPACK addition on the 233 Mhz DEC Alpha

 

tabular1650


Summary of SymPLA addition on the Intel Paragon

 

tabular1658


Summary of BLAS/ScaLAPACK addition on the Intel Paragon

 


next up previous contents
Next: ScaLAPACK version Up: LAPACK version Previous: LU factorization