I remember how I felt when I tried o3 after learning that it had "achieved an ELO score of 2727 on Codeforces, a competitive programming platform, surpassing OpenAI's Chief Scientist's score of 2665" but then failing miserably on a pretty dumb Google Apps Scripts project I had in mind...
I recently saw a recorded webinar my supervisors sent me, it was meant for administrators, and the instructor was making very bold claims, such has "it's great for scheduling", and I totally called BS on that. Thx for confirming, I didn't even try.
ACTUALLY - I got access to Manus.im and paid after i quickly ran out of free tokens, I tried a scheduling task. I asked it to schedule 20 students for one-time presentations to 10 teachers within 2 weeks, with each presentation occurring during teacher prep periods and no teacher hosting more than 3 presentations total. I gave it class list in csv and the teacher prep schedule in csv: each teacher having 2 prep period a day, but all teachers have different prep times.
Not only did Manus nail it (after a first glitch), it made me on the spot a website where I can show the results per day, or per teacher, etc. check it out https://aemnybne.manus.space/
Good to know. I just got access last week but haven’t played around with it. The more I experiment, the more I think it has to do with materials formatting and prompting. The capability seems to be latent in this generation, but different models require slightly different activations on user end.
It’s really about writing and executing code so that it can compute your task instead of relying on LLM inference only. These new tools essentially design themselves the tools they need to meet your demand.
Hi, Nick great article and experiment, I wonder what would have happened If you had trained the model with several examples (RAG for instance) I fully agree with your observations but I have managed to make the model make better solutions by teaching them, that simple questions give simple answers- that looks good or impressive to outsiders. If it looks like shit, it is shit (boogie nights) Give it context and you will be surprised 😉
Thanks. I just reworked my prompting pathway. Took things a little slower. I asked GPT 4 this time to teach me how to prompt it for success. Results much much better!!!!
Hi Nick, very good article. Scheduling problems complexity rises very quickly with number of variables and restrictions applied. In this cases, effort for solving them is much higher than for straightforward organizational tasks. Another issue is that LLMs don't behave like conventional functions, sometimes they will provide different answers for same input prompts and parameters. Follows a link that may help future efforts: https://timefold.ai/
Very cool, Tales. I will check this out. No, I was hoping to tap into a little of the unconventionality in light of the complexity of the task at hand. I tried with a much more clearly delineated prompt cycle and had better results.
I wonder if you'd have got better results using Deep Research, Nick? I've found it to do great things working with uploaded data that I would've otherwise used the o Models for. I can't guarantee you'd get a better result, but definitely worth trying.
If an LLM call fails or encounters an issue, have alternative paths or actions in place.
Examples:
Retry with a different LLM provider: If one LLM fails, try another.
Use a simpler prompt: If the original prompt is too complex, simplify it.
Route the request to a human: If the LLM cannot handle the task, delegate it to a human operator.
Provide a default response: If the LLM fails to generate a response, provide a predefined default response.
From Gemini: Implementation
* Optimization:
* If the LLM's solution is not optimal, consider using optimization algorithms (e.g., genetic algorithms, constraint programming) to further refine the schedule.
* These algorithms can be integrated with the LLM's output to improve efficiency and fairness.
* Implementation:
* Integrate the generated schedule into a scheduling system or database.
* Communicate the schedule to students and teachers.
* Create a system for feedback and change requests.
Important Considerations:
* LLM Capabilities: LLMs are powerful, but they may not always produce perfectly optimal solutions. Complex scheduling problems may require additional optimization techniques.
* Prompt Engineering: The quality of the LLM's output depends heavily on the clarity and precision of the prompt.
* Data Accuracy: Ensure the accuracy and completeness of the input data.
* Ethical Considerations: Address any potential biases or fairness issues in the scheduling process.
* Scalability: Consider the scalability of the solution for large datasets.
* Error Handling: Implement robust error handling to address potential issues during LLM execution.
* Testing: Thoroughly test the system with various scenarios to ensure its reliability.
This protocol provides a comprehensive framework for using an LLM to generate student-teacher schedules. Adapt it to your specific needs and constraints for optimal results.
Good article, Nick. You've highlighted one of the main challenges for state-of-the-art AI in 2025 - can they become robust reasoners? I wrote about this last year, using a little reasoning test of my own, which I continue to use on the latest reasoning models as they are released. My test is in some ways quite similar to yours: it involves scheduling substitutions and playing time for a 5-a-side football team.
In a sense, not surprising - for the past 70 years, ie since AI's inception, what has worked is deep/narrow AI (even that, with significant help from humans in the form of rules, data, goals); what has been elusive is generalizing, spanning disparate domains.
I remember how I felt when I tried o3 after learning that it had "achieved an ELO score of 2727 on Codeforces, a competitive programming platform, surpassing OpenAI's Chief Scientist's score of 2665" but then failing miserably on a pretty dumb Google Apps Scripts project I had in mind...
I recently saw a recorded webinar my supervisors sent me, it was meant for administrators, and the instructor was making very bold claims, such has "it's great for scheduling", and I totally called BS on that. Thx for confirming, I didn't even try.
ACTUALLY - I got access to Manus.im and paid after i quickly ran out of free tokens, I tried a scheduling task. I asked it to schedule 20 students for one-time presentations to 10 teachers within 2 weeks, with each presentation occurring during teacher prep periods and no teacher hosting more than 3 presentations total. I gave it class list in csv and the teacher prep schedule in csv: each teacher having 2 prep period a day, but all teachers have different prep times.
Not only did Manus nail it (after a first glitch), it made me on the spot a website where I can show the results per day, or per teacher, etc. check it out https://aemnybne.manus.space/
Good to know. I just got access last week but haven’t played around with it. The more I experiment, the more I think it has to do with materials formatting and prompting. The capability seems to be latent in this generation, but different models require slightly different activations on user end.
It’s really about writing and executing code so that it can compute your task instead of relying on LLM inference only. These new tools essentially design themselves the tools they need to meet your demand.
Hi, Nick great article and experiment, I wonder what would have happened If you had trained the model with several examples (RAG for instance) I fully agree with your observations but I have managed to make the model make better solutions by teaching them, that simple questions give simple answers- that looks good or impressive to outsiders. If it looks like shit, it is shit (boogie nights) Give it context and you will be surprised 😉
Thanks. I just reworked my prompting pathway. Took things a little slower. I asked GPT 4 this time to teach me how to prompt it for success. Results much much better!!!!
Hi Nick, very good article. Scheduling problems complexity rises very quickly with number of variables and restrictions applied. In this cases, effort for solving them is much higher than for straightforward organizational tasks. Another issue is that LLMs don't behave like conventional functions, sometimes they will provide different answers for same input prompts and parameters. Follows a link that may help future efforts: https://timefold.ai/
Very cool, Tales. I will check this out. No, I was hoping to tap into a little of the unconventionality in light of the complexity of the task at hand. I tried with a much more clearly delineated prompt cycle and had better results.
I wonder if you'd have got better results using Deep Research, Nick? I've found it to do great things working with uploaded data that I would've otherwise used the o Models for. I can't guarantee you'd get a better result, but definitely worth trying.
Good advice. I was hoping someone would help me troubleshoot.
I actually just downgraded to GPT 4. Had better results. Lol!!!
For how good it is at many things, it's terrible at many others. Which is fine! Because we haven't even figured out what to do with what we have!
From Google:
2. Fallback Mechanisms:
Purpose:
If an LLM call fails or encounters an issue, have alternative paths or actions in place.
Examples:
Retry with a different LLM provider: If one LLM fails, try another.
Use a simpler prompt: If the original prompt is too complex, simplify it.
Route the request to a human: If the LLM cannot handle the task, delegate it to a human operator.
Provide a default response: If the LLM fails to generate a response, provide a predefined default response.
From Gemini: Implementation
* Optimization:
* If the LLM's solution is not optimal, consider using optimization algorithms (e.g., genetic algorithms, constraint programming) to further refine the schedule.
* These algorithms can be integrated with the LLM's output to improve efficiency and fairness.
* Implementation:
* Integrate the generated schedule into a scheduling system or database.
* Communicate the schedule to students and teachers.
* Create a system for feedback and change requests.
Important Considerations:
* LLM Capabilities: LLMs are powerful, but they may not always produce perfectly optimal solutions. Complex scheduling problems may require additional optimization techniques.
* Prompt Engineering: The quality of the LLM's output depends heavily on the clarity and precision of the prompt.
* Data Accuracy: Ensure the accuracy and completeness of the input data.
* Ethical Considerations: Address any potential biases or fairness issues in the scheduling process.
* Scalability: Consider the scalability of the solution for large datasets.
* Error Handling: Implement robust error handling to address potential issues during LLM execution.
* Testing: Thoroughly test the system with various scenarios to ensure its reliability.
This protocol provides a comprehensive framework for using an LLM to generate student-teacher schedules. Adapt it to your specific needs and constraints for optimal results.
Good article, Nick. You've highlighted one of the main challenges for state-of-the-art AI in 2025 - can they become robust reasoners? I wrote about this last year, using a little reasoning test of my own, which I continue to use on the latest reasoning models as they are released. My test is in some ways quite similar to yours: it involves scheduling substitutions and playing time for a 5-a-side football team.
See: https://ai-navigator.medium.com/can-ai-reason-4adf80ebc9b1
And: https://www.youtube.com/watch?v=h_DCekOtFqk
I do expect a lot of progress in AI reasoning in the next couple of years. It will be essential for AI to take the next leap in usefulness.
Awesome!!! I will check this out!!! When I think back to the models, we were using 2 years ago, you have to admit that things are moving quickly.
Solid test, disappointing results for o3-mini.
Hi Nick, nice article.
In a sense, not surprising - for the past 70 years, ie since AI's inception, what has worked is deep/narrow AI (even that, with significant help from humans in the form of rules, data, goals); what has been elusive is generalizing, spanning disparate domains.
The double checking inherent in the time lapsed reasoning process seems not to gel with this type of task. GPT 4 is more of a work horse.