[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: real world examples




> because of the short transactions working on the same data:
>
> 1) the checkExclusion() mutex is stressed
>
> As Daniel pointed out some time ago my checkExclusion() mutex is
> not entirely correct.
>
> I see many check/waitExclusion messages in your logs but nothing
> proves that this actually causes problems.
>
> 2) deadlocks occur
>
> your logs say that the server did detect deadlocks. In the past I had some
> problems when restarting deadlocked transactions. However,
> nothing actually proves this either.
>
> But...
>
> The 3. log shows that after some operation the cluster store
> wasn't able to
> store (rename) a cluster file. This of course causes the NPE
> since the cluster
> is internally recorded but not actually there.

Ah. Now, I develop on Windows2000...there was that thing a month or two back
in v0.6.1 where you were using file.renameTo(File) which was causing
problems on win32 machines. File.renameTo(File) will overwrite an existing
file in *NIX boxes and return true, but will fail and return false in win32.

A quick grep shows that calls to file.renameTo(File) are still in:

	classicstore.Cluster.endRecovery(byte),
	wizardStore.Cluster.restoreShadow(),
	wizardStore.ClusterStore.restoreCluster(ClusterID),
	wizardStore.ClusterStore.commitCluster(Transaction, ClusterID)

I should have checked that when the last bug came up.

The fix applied to the last one was to delete the file you're moving to
first, although this leaves you open to some nasty data loss issues if your
system dies at just the right time. Better would be to move something off to
a temporary holding file, but this would slow things down quite
considerably, I suspect.

[I just went and looked in the 1.0.1 code and it seems that your changed
that fix to something better involving the old fallback of byte I/O in
wizardStore.Cluster.saveShadow(). Cool. In restoreShadow(), it still just
deletes the target file first. In
wizardStore.ClusterStore.commitCluster(Transaction, ClusterID), the target
file is also deleted first. However, in
wizardStore.ClusterStore.restoreCluster(ClusterID), this doesn't appear to
be done. Even if this isn't the culprit, the code should probably be
amended.].

So is this the problem, or is it something else, do you think?

> > So it looks like deadlocking and race conditions occur every time; if
those
> > more familiar with Ozone than I could take a look and make some educated
> > guesses? I've been looking at the internals of Ozone that deal with this
> > stuff and haven't made much headway in comprehension.
>
> We should look into the 3 points above at first.

Well, an educated guess applied to point #3. The first two points are a bit
over my head right now; perhaps you could start by explaining why your mutex
code is not quite correct?

> Reason, your TwilightMinds package looks interesting. I would
> like to check it and do some testing by myself. Are there any points to
pay attention when
> installing?

Hmm. Hopefully not; I try to write good docs. :) The only part of the
package that really pertains directly to Ozone is the Data section.

I just had a rather embarassing bug in the 0.45 version pointed out to me
yesterday -- deleting indexed Indexable objects will always throw a null
pointer exception at the moment. I'm in the middle of a fairly large update,
so I don't plan to post that fix until I'm done on that.

The only other thing that springs to mind is that you can't nest external
transactions in 0.45 -- but then, you'll expect that. I've fixed that in the
code I'm in the middle of working on.

(I really need to get the sourceforge project going for cvs access, but I
just haven't had the time).

> BTW: I would also like to include the TwilightMinds packe in the
> ozone-db.org projects page. Is this ok?

Certainly, by all means.

Reason
http://www.exratio.com/