18 Comments
User's avatar
wess trabelsi's avatar

I remember how I felt when I tried o3 after learning that it had "achieved an ELO score of 2727 on Codeforces, a competitive programming platform, surpassing OpenAI's Chief Scientist's score of 2665" but then failing miserably on a pretty dumb Google Apps Scripts project I had in mind...

I recently saw a recorded webinar my supervisors sent me, it was meant for administrators, and the instructor was making very bold claims, such has "it's great for scheduling", and I totally called BS on that. Thx for confirming, I didn't even try.

Expand full comment
wess trabelsi's avatar

ACTUALLY - I got access to Manus.im and paid after i quickly ran out of free tokens, I tried a scheduling task. I asked it to schedule 20 students for one-time presentations to 10 teachers within 2 weeks, with each presentation occurring during teacher prep periods and no teacher hosting more than 3 presentations total. I gave it class list in csv and the teacher prep schedule in csv: each teacher having 2 prep period a day, but all teachers have different prep times.

Not only did Manus nail it (after a first glitch), it made me on the spot a website where I can show the results per day, or per teacher, etc. check it out https://aemnybne.manus.space/

Expand full comment
Nick Potkalitsky's avatar

Good to know. I just got access last week but haven’t played around with it. The more I experiment, the more I think it has to do with materials formatting and prompting. The capability seems to be latent in this generation, but different models require slightly different activations on user end.

Expand full comment
wess trabelsi's avatar

It’s really about writing and executing code so that it can compute your task instead of relying on LLM inference only. These new tools essentially design themselves the tools they need to meet your demand.

Expand full comment
Martijn Sengers's avatar

Hi, Nick great article and experiment, I wonder what would have happened If you had trained the model with several examples (RAG for instance) I fully agree with your observations but I have managed to make the model make better solutions by teaching them, that simple questions give simple answers- that looks good or impressive to outsiders. If it looks like shit, it is shit (boogie nights) Give it context and you will be surprised 😉

Expand full comment
Nick Potkalitsky's avatar

Thanks. I just reworked my prompting pathway. Took things a little slower. I asked GPT 4 this time to teach me how to prompt it for success. Results much much better!!!!

Expand full comment
Tales Fernandes Costa's avatar

Hi Nick, very good article. Scheduling problems complexity rises very quickly with number of variables and restrictions applied. In this cases, effort for solving them is much higher than for straightforward organizational tasks. Another issue is that LLMs don't behave like conventional functions, sometimes they will provide different answers for same input prompts and parameters. Follows a link that may help future efforts: https://timefold.ai/

Expand full comment
Nick Potkalitsky's avatar

Very cool, Tales. I will check this out. No, I was hoping to tap into a little of the unconventionality in light of the complexity of the task at hand. I tried with a much more clearly delineated prompt cycle and had better results.

Expand full comment
Mark Laurence's avatar

I wonder if you'd have got better results using Deep Research, Nick? I've found it to do great things working with uploaded data that I would've otherwise used the o Models for. I can't guarantee you'd get a better result, but definitely worth trying.

Expand full comment
Nick Potkalitsky's avatar

Good advice. I was hoping someone would help me troubleshoot.

Expand full comment
Nick Potkalitsky's avatar

I actually just downgraded to GPT 4. Had better results. Lol!!!

Expand full comment
Michael Woudenberg's avatar

For how good it is at many things, it's terrible at many others. Which is fine! Because we haven't even figured out what to do with what we have!

Expand full comment
Terry underwood's avatar

From Google:

2. Fallback Mechanisms:

Purpose:

If an LLM call fails or encounters an issue, have alternative paths or actions in place.

Examples:

Retry with a different LLM provider: If one LLM fails, try another.

Use a simpler prompt: If the original prompt is too complex, simplify it.

Route the request to a human: If the LLM cannot handle the task, delegate it to a human operator.

Provide a default response: If the LLM fails to generate a response, provide a predefined default response.

From Gemini: Implementation

* Optimization:

* If the LLM's solution is not optimal, consider using optimization algorithms (e.g., genetic algorithms, constraint programming) to further refine the schedule.

* These algorithms can be integrated with the LLM's output to improve efficiency and fairness.

* Implementation:

* Integrate the generated schedule into a scheduling system or database.

* Communicate the schedule to students and teachers.

* Create a system for feedback and change requests.

Important Considerations:

* LLM Capabilities: LLMs are powerful, but they may not always produce perfectly optimal solutions. Complex scheduling problems may require additional optimization techniques.

* Prompt Engineering: The quality of the LLM's output depends heavily on the clarity and precision of the prompt.

* Data Accuracy: Ensure the accuracy and completeness of the input data.

* Ethical Considerations: Address any potential biases or fairness issues in the scheduling process.

* Scalability: Consider the scalability of the solution for large datasets.

* Error Handling: Implement robust error handling to address potential issues during LLM execution.

* Testing: Thoroughly test the system with various scenarios to ensure its reliability.

This protocol provides a comprehensive framework for using an LLM to generate student-teacher schedules. Adapt it to your specific needs and constraints for optimal results.

Expand full comment
Adam's avatar

Good article, Nick. You've highlighted one of the main challenges for state-of-the-art AI in 2025 - can they become robust reasoners? I wrote about this last year, using a little reasoning test of my own, which I continue to use on the latest reasoning models as they are released. My test is in some ways quite similar to yours: it involves scheduling substitutions and playing time for a 5-a-side football team.

See: https://ai-navigator.medium.com/can-ai-reason-4adf80ebc9b1

And: https://www.youtube.com/watch?v=h_DCekOtFqk

I do expect a lot of progress in AI reasoning in the next couple of years. It will be essential for AI to take the next leap in usefulness.

Expand full comment
Nick Potkalitsky's avatar

Awesome!!! I will check this out!!! When I think back to the models, we were using 2 years ago, you have to admit that things are moving quickly.

Expand full comment
Daniel Nest's avatar

Solid test, disappointing results for o3-mini.

Expand full comment
Saty Chary's avatar

Hi Nick, nice article.

In a sense, not surprising - for the past 70 years, ie since AI's inception, what has worked is deep/narrow AI (even that, with significant help from humans in the form of rules, data, goals); what has been elusive is generalizing, spanning disparate domains.

Expand full comment
Nick Potkalitsky's avatar

The double checking inherent in the time lapsed reasoning process seems not to gel with this type of task. GPT 4 is more of a work horse.

Expand full comment