Analysis of bzr update over HTTP for bzr 1.6b3
Measurement
Doing bzr update when there is nothing to do takes longer than I would expect. So I used -Dhttp and here is the summary of the transaction:
1 0.341 0.341 Startup, about to connect() 2 0.474 0.133 POST /repo/branch/.bzr/smart 3 0.884 0.410 404 Not Found, Keep-Alive = True 4 0.885 0.001 About to connect() 5 1.101 0.216 GET /repo/branch/.bzr/branch-format 6 1.295 0.194 200 OK, GET /repo/branch/.bzr/branch/format 7 1.498 0.203 200 OK, GET /repo/branch/.bzr/repository/format 8 1.703 0.205 404 Not Found, POST /repo/.bzr/smart 9 2.113 0.410 404 Not Found, GET /repo/.bzr/branch-format 10 2.329 0.216 200 OK, GET /repo/.bzr/repository/format 11 2.522 0.193 200 OK, GET /repo/.bzr/repository/pack-names 12 2.728 0.206 200 OK, HEAD /repo/.bzr/repository/shared-storage 13 2.933 0.205 200 OK, GET /repo/.bzr/repository/pack-names 14 3.137 0.204 200 OK, GET /repo/branch/.bzr/branch/last-revision 15 3.341 0.204 200 OK, <stop>
Analysis
- For some reason we connect() twice. We don't keep the transport alive after the first POST (1-4). Which is a bit strange considering we do later on when the POST fails (8-9). Other than that one time, we do manage to keep the connection alive (as seen by the max=XXX decreasing)
- We spend a lot of time finding the repository and opening it. If we only opened the repository when we needed it, we would save 7 round trips. Out of approx 15, that would be half of the round-trip time.
- If we do decide to always open the repo, we still shouldn't be reading pack-names twice (11, 13)
- Having to probe for .bzr/branch-format before probing for .bzr/branch/format also adds a fair amount of overhead.
So given our current data layout, and desire to always open the repository, I think we could shave off maybe 2 round trips.
If we are always going to open the repository connected to a branch, I think the smart protocol should be written, such that doing Branch.open returns the information about the containing repository. It could even trivially always return Branch.last_revision_info as well as Repository.shared_storage, Repository.no_working_trees, etc. It would be nice if we could return repository-specific information, like pack-names at the same time. It also should try to bypass checking for both .bzr/branch-format and .bzr/branch/format. The BzrDir information should also be obtained with the same single round-trip.
Comparison with bzr+ssh
Comparison with doing a similar operation on a local network, using bzr+ssh.:
1 0.322 0.322 Setup, hpss call 'BzrDir.open repo/branch'
2 0.412 'ssh implementation is OpenSSH'
3 4.125 result ('yes'), 'BzrDir.open_branch'
4 4.135 result ('ok'), 'BzrDir.find_repositoryV2'
5 4.160 result ('ok', '..', 'no', 'no', 'no')
'BzrDir.open repo'
6 4.169 result ('yes'), 'BzrDir.find_repositoryV2'
7 4.180 result ('ok', '', 'no', 'no', 'no')
'Repository.is_shared'
8 4.211 result ('yes',)
'Branch.last_revision_info'
9 4.324 result ('ok', revno, revision_id)bzr+ssh analysis
- We do a bit better, in that we don't open the pack-names files.
- However, you can still see that we probe for the BzrDir, and then for the branch object. And then we ask where the repository is, find out it is in the containing directory, and then issue another find in the parent directory. (We might be doing better here when it is multiple levels up, I don't know if it would do a single jump for ../.., or 2 like we do for plain http.
BzrDir.find_repositoryV2 is smart enough to return the location of the repostiory, as well as extra meta-information about it. However, the specific extra information is a bit limited, and doesn't cover what we really care about. Specifically it returns:
path, rich_root, tree_ref, external_lookup = self._find(path)
These are probably still necessary, but it seems like it would be good to return the other simple booleans, like shared-storage.
- The overall time is certainly dominated by connecting and starting a remote bzr instance. Partly this is because my remote machine is a bit slow.
