A topic that comes up fairly often on the forum (and external blogs) is how much it is possible to improve over an extended period of problem solving. Some believe that it is very uncommon to improve after the first 4000-5000 problems. Some believe most people (or rather most adults) plateau in the 1000-4000 problem range, with some initial improvement, and none after that. In the past this was an easy impression to form, as the very high volume solvers' rating graphs often showed long plateaus or even declines. However it was clear that duplicate reward reduction was a big factor in these cases. The fact that duplicate reward reduction started to become a serious issue after around the 4000 problem mark when duplicates become much more common is certainly no co-incidence. Essentially, reward reduction was masking improvement in many users. With some users seeing the same problem many times, and therefore receiving very little reward for correct answers, but full punishment for incorrect responses, it is no wonder some of these high volume users were having trouble increasingly their rating, especially if they had poor explicit memory of the solutions to problems they had seen before.
Up to now I'd only looked at a few isolated cases which seemed to contradict the 'all adults plateau early' opinion, but hadn't had time to do a more in depth analysis. Recently I had time for a somewhat deeper analysis. I looked at all blitz solvers who had done more than 30,000 blitz problems (and less than 100,000 , as the data processing for those small number of users with 100K+ attempts was taking too long to complete). Blitz was chosen, as it does a good job of factoring out the issue that standard rating can be improved by taking longer without neccessarily increasing your skill level. For similar reasons, I decided to focus only on problems that were either incorrect or correct within the time limit for gaining points, so problems that were correct but lost rating points, or took more than 5 minutes to solve were not analysed. The 5 minute cut-off was used as that is the internal cut-off for solve times, and all longer times are truncated in blitz, so avoiding these avoids having to account for attempts where the user is essentially thinking 'at their leasure'. From this set, all duplicate problem attempts were removed, leaving only a set of problems the user had seen for the first time on Chesstempo (i.e if they had been seen first in modes other than blitz (custom sets, standard etc) then they were excluded, even if they were seen only once in actual blitz mode).
For each user their set of matching problems was then sorted by date solved - from earliest to latest - and I then looked at the performance rating across several intervals. First, the initial 500 of these non-duplicate attempts was given a performance rating, then the attempts from 4000-4500, then the attempts from 10000-10500 , and finally the last 200 attemtpts. This gives 4 performance ratings, an initial level, a level after the first 4000 and 10000 non-duplicate attempts, and then the current performance level as dictated by the last 200 non-duplicate attempts.
Note that because of the high volume solving of this set of users, the 10,000-10500 non-duplicate attempts were far more than 10,000 overall attempts into each user's solving history due to the number of dupicates users are getting at that stage. For example some people may not reach 10,000 non-duplicates until after solving over 30,000 problems, depending on how many problems are available in their rating range. So to get to 10,000 non-duplicates, most users have solved a very significant number of problems, and are well beyond the 'plateau at 4000' range.
The performance rating formula used is the 'algorithm of 400' described here:
http://en.wikipedia.org/wiki/Elo_rating_system#Performance_ratingThis formula was used rather than trying to calculate glicko across all problems due to convenience reasons - I already had an SQL (database query) implementation of the '400' algorithm and developing a SQL query that performed glicko within the query was more work than I wanted to do right now, and I didn't see a strong reason why it would produce significantly more meaningful results.
To be able to include data from attempts at or beyond 10,000 non-duplicates, users who had over 30000 total attempts , but under 10500 non-duplicate attempts were excluded. This reduced the available analysable users from 116 down to 77. Users who had minimum ratings below 600 were also excluded. The 600 exclusion was done to exclude users like this:
http://chesstempo.com/chess-statistics/torosentadowho deliberately got hundreds of problems wrong in a sequence in order to artificially drop their rating.
To counter for rating drift over time impacting the results, the rating at the time the problem was done was ignored in the performance rating calculations, and the current rating of the problem was used instead (and attempts on disabled problems were also ignored , as these problems don't have up to date current ratings). This means all users are being compared on the same level for each problem irrespective of when their original attempt was, and what the rating of the problem was at the time. This is especially important for these high volume solvers, as often they get served new problems very quickly before the problem has time to settle, so the usage of the current rating also avoids this issue.
Unsurpisingly, almost all of those 77 users improved from their initial 500 attempt performance rating. 93% of the users improved from their initial 500 attempt performance to their 4000-4500 performance sample. The average rating improvement during that time was 88 rating points. At this level, 4000 non-duplicates probably equates to around 5000 total duplicate+non-duplicate attempts for most people, as duplicates at that early stage are not yet a large percentage of problems served.
Now the conjecture is that by this stage it is becoming impossible for adult solvers to improve further, however the data does not seem to support that. The rate of average improvement does begin to slow, and diminishing returns are certainly starting to become a factor, but from the 4000-4500 slice to the 10000-10500 slice, 82% of solvers still improved, with the average improvement sitting at 48 performance rating points. Note that this average includes the decline of the 18% non-improvers who were equal or worse than their 4000-4500 slice. Jumps over 200 were seen in this range, and jumps over 100 were not uncommon (sorry no std deviation or median data at this point).
The final comparison was between the final 200 non-duplicate attempts for each user, and their 4000-4500 level. Here 87% of solvers had improved in their most recent problem performance over their 4000-4500 performance, with an average improvement of 84 points (which is on top of the 88 rating point improvement the average user had already made after their first 500 attempts). This indicates that not only do people continue to improve from 4000 non-duplicates to 10000 non-duplicates but they are apparently improving even more when going beyond 10000 attempts , with both a larger percentage and a larger average improvement compared to the 4000 to 10000 comparison.
The 10000 to final improvements are fairly modest at a 36 average, this is partly due to further plataeuing, but partly due to quite a few people clustering their total non-duplicates quite close 10000, so have a smaller scope time wise for improvement. For those who had over 15000 non-duplicate attempts (25 total people), the average 10,000 to final gap was just under 50 performance rating points.
There is still the issue of how relevant is all of this to older solvers. While I don't have age data for all these users, I do have FIDE year of birth data for the users who had entered their FIDE id in their preferences. Unfortunately this was only 5 people in this data set. Their average age was 51. Cutting down from 116 to 77 users by excluding those with less than 10,000 total duplicates removed 1 non-adult, but several other excluded people were in the 30+ bracket. While it is a very small sample size, based on personal knowledge of those on the forum, and the supporting data of the FIDE ages I think it is fair to say the majority of the users in the 77 person sample are adults, many of them quite old. 3 of the 5 over 40 year olds improved from 4000 non-duplicate attempt level performance to their final 200 performance, with an average improvement of 45 for the three improvers. It is obviously hard to draw conclusions on a sample of 5, and I think the much larger sample of 77 is probably a better indicator, but either way, improvement for adults after a large number of attempts seems far from impossible based on all the data I've seen, even if the average improvement is relatively modest.
In summary users in this population on average improve their performance rating from the initial 500 problems to the 4000 problem mark by around 88 points. They then almost match this when going from the 4000 mark to their current level. Note that this is definitely sub-linear progress as usually the number of probems attempts per each non-duplicate steadily rises as the total number of problems rises (although this can be partly mitigated by an increasing rating, providing access to unsolved higher rated problems). This indicates decent improvement can still be gained after the first 4000-5000 problem attempts, although it does require a fair bit of effort. The average performance ratings across this group for initial 500, 4000, 10000 and final were 1571, 1660, 1708 and 1743.
Given many of these users are probably using far from optimal learning/training strategies, I think it is safe to assume these are the minimal average improvements possible, and better improvement is likely with optimal learning methods.
Regards,
Richard.