You attribute "purpose" as though the algorithm is aware of any such "purpose". It is not. It is only minimizing a loss function(s). The idea that it would learn it's own grammar has to do with the reward the AI gets whenever it gets a response from a "user". If it ends up learning by communicating with itself (ie. it is not longer receiving the input vector which indicates if the source is organic or not) then it will start to "develop" it's own grammar and syntax. Whatever gets a response most readily from itself will be learned from and propagated further.
Hence the importance of removing bot responses from the dataset. This has been well studied and is a major part of what data brokers do to increase the value of their data offerings.
(post is archived)