This is my current view of Vertical Information Retrieval & Large Scale Search Engines (NB. they are somewhat different!)
My Background
I first learned about local co-occurrence and traversing sorted & inverted indices as a kid at Brighton Grammar School whilst being shown how to use a proper multi-part thesaurus in the 90s, well after I had already started coding algorithms in C. So, I have been doing this Information Retrieval thing for quite a while.
In my teens I had my account banned for two weeks at RMIT University for doing too much #InfoRec experiments on Minyos. At the time everyone else was getting their accounts banned for SunOS 4.1 haxing in the Xterm labs on Yallara -- it was quite a funny situation.
Since then I've done the full postgraduate degree at RMIT on #InfoRec under the watchful eye of Professor Balbin, back when it was at its absolute peak and it was time to upgrade the world's fastest sorting/searching/filesystem algorithm (Burst Tries/HAT-Tries) again. I was yelling down the halls to Ranjan about it when he was working on it - the whole school knew what was going on, and Ranjan and I would talk regularly about the incremental improvements. I even did my own Bayesian version on the side called p̂-Tries which is 2x faster again (and which you may see someday if I feel like the world needs it).
I might have also seen some of the #InfoRec course from Stanford while I was in #SV, but that's for us to know and you to never find out ;) Their course was more focused on Search Engines than Information Retrieval, so it's nice when one gets exposed to both views.
Anyways, I do know Information Retrieval, and Search Engines, and #InfoRec; and my credentials are well established. So, this is what I currently think about #InfoRec.
Re: The ACM SIGIR Conference happening in Melbourne
I do have much to say about the NeuroPhysIIR Workshop at ACM SIGIR CHIIR 2025 conference at RMIT in Melbourne - none of the big researchers from Melbourne #InfoRec or RMIT CS&IT (RIP) are going to be there. If we aren't there is it a real conference? Is 'RMIT Computing' even a valid CS school, atm?
I certainly don't think the topics are appropriate for the current era, nor do I think it's safe for #InfoRec practitioners to be in Melbourne right now.
Please see this LinkedIn Post for more info from me on the matter.
I am unable to attend because I am trapped in my home bonded in full bondage in indentured servitude by the local police, but attendees are welcome to contact me during the conference:
- I'm @CompSciFutures.888 on Signal Messenger or
- ap@andrewprendergast.com if you don't care about opsec
as long as it's about #InfoRec algorithmics and not our foreign interference problem.
I am safe in the short term, but I am a little bit stuck.
As in locked in a cage stuck,
possibly in domestic + cyber imprisonment for 20 years without being told?
Anyways, I would enjoy a good chat about #InfoRec.
But don't use the normal telephony network,
because it will probably get blocked.
Edit 24-Mar-25:
I received a note from Falk (who is awsm). I don't think it changes anything I've said here, but I'm doing the right thing and posting it. I'm yet to get confirmation of attendance by researchers of significance from Melbourne InfoRec (NB. IR = InfoRec):
Thanks for your message. You’ll be happy to know that there are in fact many senior researchers, both local and international, at the CHIIR conference today 😉 And just to clarify, you seem to be conflating the ACM SIGIR Conference, the ACM CHIIR conference, and the NeuroPhysIIR workshop in your comments — these are three different things, and not the same. (Just highlighting this since I'm sure you want to convey correct information via your blog!) Regards, Falk. —— Falk Scholer Professor of Information Access and Retrieval Technologies RMIT University, Melbourne, Australia
"The Ode to 10 Blue Links":
Having worked on more than a couple large major search engines in my life (in Silicon Valley, e.g., Google & AltaVista), and having helped deal with the '98% generative' problem more than one time, I do know a thing or two about Information Retrieval, including how to do vertical search and 'Search Without a Search Box' with all the magic of Google First Page Results but without the PageRank Ergodic Markov Chain bit.
To that end, here is a little ditty, a little something I wrote about #InfoRec and ACM TREC competitions:
I am telling you this Just so that you know. We spiked the crap out of The Internets, long long ago. What you think is secret, We already know. Through your little input box, We can find out anything There is to know. About your darkest secrets, all of you. LLMs, search engines, Retrieval systems. Anything with an inverted index, Our tricks work better than you know. We are only interested in Relative metrics, Not causing a Punch and Judy show. Just look after your users, And we will go with the flow. definitely do not psyop, and we let almost anything else go. Do you think we want to adjudicate, The competition blow by blow? Through that little search box, We can see well into your house, Every detailed piece of execution flow. But we also know how to keep secret, Whatever there is to know. #InfoRec only talks to InfoRec When there is something Important to know. Like who is able to handle Case sensitive nouns! That’s something that we will show, To our inner circle of inforec friends -- The rest of the world is not to know. We really do love indices & InfoRec And are much much more nerdy About it. Thank you could ever possibly Know. We do know what we are getting into, how much governments want the SQLs and SQRs, those 'innermost thoughts and desires - fleeting', is what governments want to know. We do know what proprietary information is, we don't ever inappropriately tell what we know. We always do the right thing, and stay well out of the punch and judy show. But please pay us properly, AND leave us, our families, our algorithms and our universities, well well alone. THEN we really really will, just go with the flow. The very very best of our algorithms, you are yet to know. And please never steal from us, or the curse of 1000 years, will soon flow. And please do not tarnish, or interrupt user flow. We are telling you all of this, because we do still know: Users really do like to love, the operators of the day of their 10 blue links.
From #InfoRec, in the style of Dr. Seuss by Dr. Loose! Or is it? Ya know?
Contemporary Practical Application of #InfoRec:
And here is how it is done - with the generalised form of the Information Retrieval Equation: 50% classical, 50% probabilistic. Just add Causality and Information Theory, for first page result like performance, just for fun:
Vizicks UI Humanizing Data Exemplar:
Here is an #InfoRec system that uses
- Humanizing Data vizualisation philosophy,
- lots of NP-hard 'prune as you go' estimation + probabilistic A* traversal to throw away noisy datapoints,
- Search Without a Search Box,
- The Datapoint You Need to Know Right Now,
- real AI as an Expert System that includes Probabilistic Logic with Causality So Dense, it Almost Rips Holes in Walls,
- When a Metric Goes Critical and
- runs every possible subcube through the above Information Retrieval Equation for ranking:
ON VIZUALISATION SUITES
— Ä̷̬͖̽͗P̷̭̳͔͇̊ on CompSciFutures (∀/acc) (@CompSciFutures) October 18, 2024
Having done ALOT of #dataviz, the optimal & most #humane setup for most purposes (i.e. immersion not required) is 3 connected touch screens that each have a different view into the same current datapoint.
Size, color, shape, animation & no numbers matters. pic.twitter.com/ufIILIu8p2
You do need that 'Causality So Dense ...' bit in your expert system embedded in the P(relevance|x1,...,xn) piece to give it that uniquely Google First Page Results feel so that it always tells you about the datapoint you need to know about right now. For that, you are going to need a damn good Bayesian Analyst, otherwise it'll feel quite average.
Here is the showreel:
I built this for Australia Post. It was quite amazing. The socials/NPS stuff was a bit average/off (because it's hard to get the data without doing single customer view so the requirements are a bit of a moving target), but the EBIT related & paid advertising stuff was mind blowingly good. Forget data hygiene, just #InfoRec and vizualise your way through whatever you've got at the time seemed to be the outcome! It was so good, one very senior advertising executive from AdLand was on their knees begging to have it for themselves.
On Social and Economic Networks
Something to consider if you are doing non-law of large numbers, 'we live in a long tail universe' social/economic graph analysis:
𝗪.𝗥.𝗧. 𝗦𝗢𝗖𝗜𝗔𝗟 𝗚𝗥𝗔𝗣𝗛 𝗫𝗙𝗢𝗥𝗠𝗦
— Ä̷̬͖̽͗P̷̭̳͔͇̊ on CompSciFutures (∀/acc) (@CompSciFutures) March 23, 2025
If your transformation over a social graph loses fidelity, expressiveness or parsimony as you pass your representation through a #flowers 𝒫 first, then it's probably useless - or it's very usefully mid data.
Do you know?
.\ 𝒫 https://t.co/BjdSeqqOuz
On Search Query Logs
And just a reminder, on the topic of search queries (i.e., what we call SQLs & SQRs), they do contain your 'innermost thoughts and desires - fleeting'. Do we really need to keep them anymore? Why can't we absorb them into a properly privacy preserving generative model with no user IDs, then throw the logs out? That is all we need, and it will go faster!
Those SQLs & SQRs logs are the main thing governments keep targeting search engines for. If we don't have them - they should leave us alone. It's extremely expensive to run a system that keeps those logs anyway. Every government wants them, and they do end up getting them, by force if they have to. These days, their spooks, their cyber espionage and their vendor backdoors are just too hard to fight against AND get anything else done, hence why these SQLs and SQRs are now so damn expensive to keep.
We also can't ensure any guarantees that they will use them ethically in their analyses (e.g., by using loess regression) or ensure adherence to basic controls like data privacy or destruction. And then there is the matter of #InfoRec computer scientists suddenly losing family members in diabolical but oddly consistent ways -- I know a few myself personally, and have lost a father to this problem. Do we really want to be involved in that anymore?
To that end, see my Urban Dictionary entry on the topic! :)
Once upon a time Google was filled with people of the calibre that they simply wouldn't see user data when we looked at systems and we could stop governments from getting them. To us, it just wasn't there. When we see user data, what is in it, we really don't care. I really do love doing #InfoRec, and I do care about serving our users and looking after their data - always. In the most privacy preserving of ways. I say we get rid of the Search Query Logs at least - that is the one they REALLY want. Anyways, this is as close as I would ever get to them:
If I could work on anything I want to, it would be vertical search. I am so damn #InfoRec, I can’t stop making weird quirky inverted indices + my AI + visualisation.
— Ä̷̬͖̽͗P̷̭̳͔͇̊ on CompSciFutures (∀/acc) (@CompSciFutures) March 21, 2025
I’d probably be running the “yellow box” department at #HeyGoogle @Google by now.
␄https://t.co/vAJbHu2ftx
Banks aggregate their data at the merchant transaction and not the product level for similar reasons -- we should follow their lead. Product level transactions are even worse because they aren't 'fleeting', they are 'with conviction'.
The next Turing Award should go to 'Melbourne InfoRec'
Being identified as a significant contributor to Information Retrieval can be quite treachrous these days, and many of us have not been paid at all for our work, ranging from the fastest and most amazing algorithms in the world, to all of that work it takes to run ACM TREC competitions. We can't even safely have #InfoRec group lunches anymore.
Just to run a TREC competition requires a full university sized CS lab, filled with #InfoRec PhDs or equivalent + trained in everything from linguistics and lexicography to all the other things we need to know, including a bit of computational neuroscience here and there! Then we have a 48 hour long pizza party in the lab where we carefully class label documents, then send off the eval & held out test sets to the search engines so they can run the results. And then AFTER the competition, some more sneaky InfoRec PhD level brainiacs come in and sometimes do qualitative or quantitative forensic Information Retrieval to re-run the metrics ourselves and find out whats going on, e.g. by comparing held out test data. It is a rather large effort for all involved.
To have Search Engines not take the last one seriously is frankly, offensive. I think the last competition is possibly the last.
We really have contributed a lot and very few of us have ever been paid. There is not a chance in hell that we will ever provide a list of names. Just the existence of such a list is gravely dangerous.
To that end, I think The Association For Computing Machinery (ACM) should award the next ACM Turing Award to Melbourne InfoRec, just for being awesome. Rachel Griffiths and Cyrus should accept it on behalf of all of us. Cyrus did get his mum killed by CIA and his character smeared for detecting the inforec problem in AltaVista. The rest of us would like to stay unidentified thanks.
The award can go into the Gold Lions cupboard at Clemenger BBDO and I think Clems knows why; they do have a few job numbers that were opened up because ACM referred them to information retrievalists in the past, and they did do quite a smashing job in record time, as usual.
Long live Melbourne InfoRec!
I do love #InfoRec. And advertising. And traffic. And vizualisation. Separately. And together.
But the one thing I yearn for the most for isn't a Nobel, a Fields or a Turing: It's a *hushed tone* Gold Lion.
That's the one you really want. I know I do!
That and one free BBQ per week!! Long live RMIT CS&IT. Long live Melbourne #InfoRec.
.\𝒫
Banking clearance last updated 2023.
PS. Follow @CompSciFutures on 𝕏 to keep up-to-date on the wildest of computer science wizardry. Computer science of the ages for the ages from the ages. For ages. And more. Pass it on.