Skip to main content

Data Scientific Intuition that defines Good vs. Bad scientists

Data Scientific Intuition that defines Good vs. Bad scientists

Member for

6 months
Real name
Keith Lee
Position
Professor

수정

Many amateur data scientists have little respect to math/stat behind all computational models
Math/stat contains the modelers' logic and intuition to real world data
Good data scientists are ones with excellent intuition

On SIAI's website, we can see most wannabe students go to MSc AI/Data Science program intro page and almost never visit MBA AI program pages. We have a shorter track for MSc that requires extensive pre-study, and much longer version that covers missing pre-studies. Over 90% of wannabes just take a quick scan on the shorter version and walk away. Less than 10% to the longer version, and almost nobody to the AI MBA.

We get that they are 'wannabe' data scientists with passion, motivation, and dream with self-confidence that they are the top 1%. But the reality is harsh. So far, less than 5% applicants have been able to pass the admission exam to MSc AI/Data Science's longer version. Almost never we have applicants who are ready to do the shorter one. Most, in fact, almost all students should compromise their dream and accept the reality. The fact that the admision exam is the first two courses of the AI MBA, lowest tier program, already bring students to senses that over a half of applicants usually disappear before and after the exam. Some students choose to retake the exam in the following year, but mostly end up with the same score. Then, they either criticize the school in very creative ways or walk away with frustrated faces. I am sorry for keeping such high integrity of the school.

Sourece: ChatGPT

Data Scientific Intuition that matters the most

The school focuses on two things in its education. First, we want students to understand the thought processes of data science modelers. Support Vector Machine (SVM), for example, reflects the idea that fitting can be more generalized if a separating hyperplane is bounded with inequalities, instead of fixed conditions. If one can understand that the hyperplane itself is already a generalization, it can be much easier to see through why SVM was introduced as an alternative to linear form fitting and what are the applicable cases in real life data science exercises. The very nature of this process is embedded in the school's motto, 'Rerum Cognoscere Causas' ((Felix, qui potuit rerum cognoscere causas - Wikipedia)), meaning a person pursuing the fundamental causes.

The second focus of the school is to help students where and how to apply data science tools to solve real life puzzles. We call this process as the building data scientific instuition. Often, math equations in the textbooks and code lines in one's program console screens do not have any meaning, unless it is combined in a way to solve a particular problem in a peculiar context with a specific object. Unlike many amateur data scientists' belief, coding libraries have not democratized data science to untrained students. In fact, the codes copied by the amateurs are evident examples of rookie failures that data science tools need must deeper background knowledge in statistics than simple code libraries.

Our admission exam is designed to weed out the dreamers or amateurs. After years of trials and errors, we have decided to give a full lecture of elementary math/stat course to all applicants so that we can not only offer them a fair chance but also give them a warning as realistic as our coursework. Previous schooling from other schools may help them, but the exam help us to see if one has potential to develop 'Rerum Cognoscere Causas' and data scientific intuition.

Intution does not come from hard study alone

When I first raised my voice for the importance of data scientific intution, I had had severe conflicts with amateur engineers. They thought copying one's code lines from a class (or a github page) and applying it to other places will make them as good as high paid data scientists. They thought these are nothing more than programming for websites, apps, and/or any other basic programming exercises. These amateurs never understand why you need to do 2nd-stage-least-square (2SLS) regression to remove measurement error effects for a particular data set in a specific time range, just as an example. They just load data from SQL server, add it to code library, and change input variables, time ranges, and computer resources, hoping that one combination out of many can help them to find what their bosses want (or what they can claim they did something cool). Without understanding the nature of data process, which we call 'data generating process' (DGP), their trials and errors are nothing more than higher correlation hunting like untrained sociologists do in their junk researches.

Instead of blaming one code library worse performing than other ones, true data scientists look for embedded DGP and try to build a model following intuitive logic. Every step of the model requires concreate arguments reflecting how the data was constructed and sometimes require data cleaning by variable re-structuring, carving out endogeneity with 2SLS, and/or countless model revisions.

It has been witnessed by years of education that we can help students to memorize all the necessary steps for each textbook case, but not that many students were able to extend the understanding to ones own research. In fact, the potential is well visible in the admission exam or in the early stage of the coursework. Promising students always ask why and what if. Why SVM's functional shape has $1/C$ which may limit the range of $C$ in his/her model, and what if his/her data sets with zero truncation ends up with close to 0 separating hyperplane? Once the student can see how to match equations with real cases, they can upgrade imaginative thought processes to model building logic. For other students, I am sorry but I cannot recall successful students without that ability. High grades in simple memory tests can convince us that they study hard, but lack of intuition make them no better than a textbook. With the experience, we design all our exams to measure how intuitive students are.

Source= Reddit

Intuition that frees a data scientist

In my Machine Learning class for tree models, I always emphasize that a variable with multiple disconnected effective ranges in trees has a different spanned space from linear/non-linear regressions. One variable that is important in a tree space, for example, may not display strong tendency in linear vector spaces. A drug that is only effective to certain age/gender groups (say 5~15, 60~ for male, 20~45 female) can be a good example. Linear regression hardly will capture the same efffective range. After the class, most students understand that relying on Variable Importances of tree models may conflict with p-value type variable selections in regression-based models. But only students with intuition find a way to combine both models that they find the effective range of variables from the tree and redesign the regression model with 0/1 signal variables to separate the effective range.

The extend of these types of thought process is hardly visible from ordinary and disqualified students. Ordinary ones may have capacity to discern what is good, but they often have hard time to apply new findings to one's own. Disqualified students do not even see why that was a neat trick to the better exploitation of DGP.

What's surprising is that previous math/stat education mattered the least. It was more about how logical they are, how hard-working they are, and how intuitive they are. Many students come with the first two, but hardly the third. We help them to build the third muscle, while strenghtening the first. (No one but you can help the second.)

The re-trying students ending up with the same grades in the admission exam are largely because they fail to embody the intuition. It may take years to develop the third muscle. Some students are smart enough to see the value of intuition almost right away. Others may never find that. For failing students, as much as we feel sorry for them, we think that their undergraduate education did not help them to build the muscle, and they were unable to build it by themselves.

The less chanllenging tier programs are designed in a way to help the unlucky ones, if they want to make up the missing pieces from their undergraduate coursework. Blue pills only make you live in fake reality. We just hope our red pill to help you find the bitter but rewarding reality.

Research Category

Member for

6 months
Real name
Keith Lee
Position
Professor

관련기사

MSc AI/Data Science vs. Boot Camp for AI

MSc AI/Data Science vs. Boot Camp for AI

Member for

5 months 2 weeks
Real name
David O'Neill
Position
Professor
Bio
Founding member of GIAI & SIAI
Professor of Data Science @ SIAI

수정

Boot camp is for software programming without mathematical training
MSc is a track for PhD, with in-depth scientific research written in the language of math and stat
We respect programmers, but our works are significantly varying

Due to the fact that we are running SIAI, an higher educational institution for AI/Data Science, we often have questions about the difference between Boot Camps for AI and MSc programmes. The shortest answer is the difference in Math requirements. Masters track is for people looking for academic training so that one can read academic papers in that subject. With PhD in the topic, we expect the student to be able to lead a research. From Boot Camp, sorry to be a little aggressive here, but we only expect a 'Coding Monkey'.

We are aware that many countries are shallow in AI/Data Science that they want employees only to be able to best use of Open AI's and AWS's libraries by Rest API. For that, boot camp should be enough, unless the boot camp teacher does not know how to do so. There are nearly infinite amount of contents for how to use Rest API for your software, regardless of your backend platform, be it an easy script languages like Python or tough functional ones like OCaml. Difficulties are not always indicators of determinants in challenges, and we, as data scientists at GIAI, care less about what language you use. What's important is how flexible your thinking for mathematically contained modeling.

Boot camp for software programing, MSc for scientific training

Unfortunately, unless you are lucky enough to be born as smart as Mr. Ramanujan, you cannot learn math modeling skills from a bunch of blogs. Programming, however, has infinitely many proven records of excellent programmers without school traninng. Elon Musk is just one example. He did Economics and Physics in his undergrad at U Penn, and he only stayed one day in the mechanical engineering PhD program at Stanford University. Programming is nothing more than a logic, but math needs too many building blocks to understand the language.

When we first build SIAI, we had quite a lengthy discussion for weeks. Keith was firm that we should stick to mathematical aspects of AI/Data Science. (which doesn't mean we should only teach math, just to avoid any misunderstanding.) Mc wanted two tier tracks for math and coding. We later found that with coding, it is unlikely that we can have the school accreditted by official parties, so we end up with Keith's idea. Besides, we have seen too many Boot Camps around the world that we do not believe we can be competitive in that regard.

The founding motto of the school is 'Rerum Cognoscere Causaus', meaning 'the real cause of things'. With mathematical tools, we were sure that we can teach what are the reason behind a computational model was first introduced. Indeed, Keith has done so well in his Scientific Programming that most students no longer bound to media brainwashing that Neural Network is the most superior model.

Scientists do our own stuff

If you just go through Boot Camps for coding, chances are that you can learn the limitations of Neural Network just by endless trials and errors, if not somebody's Medium posts and Reddit comments. In other words, without the proper math training, it is unlikely one can understand how the computational logics of each model are built, which makes us to aloof from all programmers without necessary math training.

The very idea comes from multiple rounds of uneasy exposures to software engineers without a shred of understanding in modeling side of AI. They usually claim that Neural Network is proven to be the best model, and they do not need any other model knowledge. And all they have to do is to run and test it. Researchers at GIAI are trained scientists, and we mostly can guess what will happen just by looking at equations. And, most importantly, we are well aware that NN is the best model only for certain tasks.

They kept claim that they were like us, and some of them wanted to build a formal assocation with SIAI (and later GIAI). It's hard for us to work with them, if they keep that attitude. These days, whenever we are approached by third parties, if they want to be at equals with us, we ask them to show us math training levels. Please make no mistake that we respect them as software engineers, but we do not respect them as scientists.

Guess aforementioned story and internal discomfort tells you the difference between software engineers and data/research scientists, let alone tools that we rely on.

We screen out students by admission exams in math/stat

With the experience, Keith initiated two admission exams for our MSc AI/Data Science programmes. At the very beginning, we thought there will be plenty of qualifying students, so we used final year undergrad materials. There was a disaster. We gave them two months of dedicated training. Provided similar exams and solved each one of them with extra detail. But, only 2 out of 30 students were able to get grades good enough to be admitted.

We lowered the level down to European 2nd year (perhaps American 3rd year), and the outcome wasn't that different. Students were barely able to grasp superficial concepts of key math/stat. This is why we were kinda forced to create an MBA program that covers European 2nd year teaching materials with ample amount of business application cases. With that, students survive, but answer keys in their final exam tell us that many of them belong to coding Boot Camps, not SIAI.

From year 2025 and onwards, we will have one admission exam for MSc AI/Data Science (2 year) in March, after 2 months pre-training in Jan and Feb. The exam materials will be 2nd year undergrad level. If a student passes, we offer an exam with one notch up in June, again after 2 months pre-training in Apr and May. This will give them MSc AI/Data Science (1 year) admission.

Students who failed the 2-year track admission, we offer them MBA AI program admission, which covers some part of the 2-year track courses. If they think they are ready, then in the following year, they can take the admission exam again. After a year of various courework, some students have shown better performance, based on our statistics, but not by much. It seemed like the brain has its limit that they cannot go above.

Precisely by the same reason, we are reasonably sure that not that many applicants will be able to come to 2-year track, and almost no one for the 1-year track. More details are available from below link:

Research Category

Member for

5 months 2 weeks
Real name
David O'Neill
Position
Professor
Bio
Founding member of GIAI & SIAI
Professor of Data Science @ SIAI

관련기사

Why is STEM so hard? why high dropOut?

Why is STEM so hard? why high dropOut?

Member for

5 months 2 weeks
Real name
Catherine Maguire
Position
Professor
Bio
Founding member of GIAI
Professor of Data Science @ SIAI

수정

STEM majors are known for high dropouts
Students need to have more information before jumping into STEM
Admission exam and tiered education can work, if designed right

Over the years of study and teaching in the fields of STEM(Science, Technology, Engineering, and Mathematics), it is not uncommon to see students disappearing from the program. They often are found in a different program, or sometimes they just leave the school. There isn't commonly shared number of dropout rate across the countries, universities, and specific STEM disciplines, but it has been witnessed that there is a general tendancy that more difficult course materials drive more students out. Math and Physics usually lose the most students, and graduate schools lose way more students than undergraduate programs.

Photo by Monstera Production / Pexel

At the onset of SIAI, though there has been growing concerns that we should set admission bar high, we have come to agree with the idea that we should give chances to students. Unlike other universities with somewhat strict quota assigned to each program, due to size of classrooms, number of professors, and etc., since we provide everything online, we thought we are limitless, or at least we can extend the limit.

After years of teaching, we come to agree on the fact that rarely students are ready to study STEM topics. Most students have been exposed to wrong education in college, or even in high school. We had to brainwash them to find the right track in using math and statistics for scientific studies. Many students are not that determined, neither. They give up in the middle of the study.

With stacked experience, we can now argue that the high dropout rate in STEM fields can be attributed to a variety of factors, and it's not solely due to either a high number of unqualified students or the difficulty of the classes. Here are some key factors that can contribute to the high dropout rate in STEM fields:

  1. High Difficulty of Classes: STEM subjects are often challenging and require strong analytical and problem-solving skills. The rigor of STEM coursework can be a significant factor in why some students may struggle or ultimately decide to drop out.
  2. Lack of Preparation: Some students may enter STEM programs without sufficient preparation in foundational subjects like math and science. This lack of preparation can make it difficult for students to keep up with the coursework and may lead to dropout.
  3. Lack of Support: Students in STEM fields may face a lack of support, such as inadequate mentoring, tutoring, or academic advising. Without the necessary support systems in place, students may feel isolated or overwhelmed, contributing to higher dropout rates.
  4. Perceived Lack of Relevance or Interest: Some students may find that the material covered in STEM classes does not align with their interests or career goals. This lack of perceived relevance can lead to disengagement and ultimately dropout.
  5. Diversity and Inclusion Issues: STEM fields have historically struggled with diversity and inclusion. Students from underrepresented groups may face additional barriers, such as lack of role models, stereotype threat, or feelings of isolation, which can contribute to higher dropout rates.
  6. Workload and Stress: The demanding workload and high levels of stress associated with STEM programs can also be factors that lead students to drop out. Balancing coursework, research, and other commitments can be overwhelming for some students.
  7. Career Prospects and Job Satisfaction: Some students may become disillusioned with the career prospects in STEM fields or may find that the actual work does not align with their expectations, leading them to reconsider their career path and potentially drop out.

It's important to note that the reasons for high dropout rates in STEM fields are multifaceted and can vary among individuals and institutions. Addressing these challenges requires a holistic approach that includes providing academic support, fostering a sense of belonging, promoting diversity and inclusion, and helping students explore their interests and career goals within STEM fields.

Photo by Max Fischer / Pexel

Not just for the gifted bright kids

Given what we have witnessed so far, at SIAI, we have changed our admission policy quite dramatically. The most important of all changes is that we have admission exams and courses for exams.

Although it sounds a little paradoxical that students come to the program to study for exam, not vice versa, we come to an understanding that our customized exam can greatly help us to find true potentials of each student. The only problem of the admission exam is that the exam mostly knocks off students by the front. We thus offer classes to help students to be prepared.

This is actually a beauty of online education. We are not bounded to location and time. Students can go over the prep materials at their own schedule.

So far, we are content with this option because of following reasons:

  1. Self-motivation: The exams are designed in a way that only dedicated students can pass. They have to do, re-do, and re-do the earlier exams multiple times, but if they do not have self-motivation, they skip the study, and they fail. The online education unfortunately cannot give you detailed mental care day by day. Students have to be matured in this regard.
  2. Meaure preparation level: Hardly a student from any major, be it a top schools' STEM, we find them not prepared enough to follow mathematical intuitions thrown in classes. We designed the admission exam one-level below their desired study, so if they fail, that means they are not even ready to do the lower level studies.
  3. Introduction to challenge: Students indeed are aware of challenges ahead of them, but the depth is often shallow. 1~2 courses below the real challenge so far consistently helped us to convince students that they need loads of work to do, if they want to survice.

Selfdom there are well-prepared students. The gifted ones will likely be awarded with scholarships and other activities in and around the school. But most other students are not, and that is why there is a school. It is just that, given the high dropout in STEM, it is the school's job to give out right information and pick the right student.

Research Category

Member for

5 months 2 weeks
Real name
Catherine Maguire
Position
Professor
Bio
Founding member of GIAI
Professor of Data Science @ SIAI

관련기사

Is online degree inferior to offlinie degree?

Is online degree inferior to offlinie degree?

Member for

6 months
Real name
Keith Lee
Position
Professor

수정

Not the quality of teaching, but the way it operates
Easier admission and graduation bar applied to online degrees
Studies show that higher quality attracts more passion from students

Although much of the prejudice against online education courses has disappeared during the COVID-19 period, there is still a strong prejudice that online education is of lower quality than offline education. This is what I feel while actually teaching, and although there is no significant difference in the content of the lecture itself between making a video lecture and giving a lecture in the field, there is a gap in communication with students, and unless a new video is created every time, it is difficult to convey past content. It seems like there could be a problem.

On the other hand, I often get the response that it is much better to have videos because they can listen to the lecture content repeatedly. Since the course I teach is an artificial intelligence course based on mathematics and statistics, I heard that students who forget or do not know mathematical terminology and statistical theory often play the video several times and look up related concepts through textbooks or Google searches. There is a strong prejudice that the level of online education is lower, but since it is online and can be played repeatedly, it can be seen as an advantage that advanced concepts can be taught more confidently in class.

Is online inferior to offline?

While running a degree program online, I have been wondering why there is a general prejudice about the gap between offline and online. The conclusion reached based on experience until recently is that although the lecture content is the same, the operating method is different. How on earth is it different?

The biggest difference is that, unlike offline universities, universities that run online degree programs do not establish a fierce competition system and often leave the door to admission widely open. There is a perception that online education is a supplementary course to a degree course, or a course that fills the required credits, but it is extremely rare to run a degree course that is so difficult that it is perceived as a course that requires a difficult challenge as a professional degree.

Another difference is that there is a big difference in the interactions between professors and students, and among students. While pursuing a graduate degree in a major overseas city such as London or Boston, having to spend a lot of time and money to stay there was a disadvantage, but the bond and intimacy with the students studying together during the degree program was built very densely. Such intimacy goes beyond simply knowing faces and becoming friends on social media accounts, as there was the common experience of sharing test questions and difficult content during a degree, and resolving frustrating issues while writing a thesis. You may have come to think that offline education is more valuable.

Domestic Open University and major overseas online universities are also trying to create a common point of contact between students by taking exams on-site instead of online or arranging study groups among students in order to solve the problem of bonding and intimacy between students. It takes a lot of effort.

The final conclusion I came to after looking at these cases was that the difficulty of admission, the difficulty of learning content, the effort to follow the learning progress, and the similar level of understanding among current students were not found in online universities so far, so we can compare offline and online universities. I came to the conclusion that there was a distinction between .

Would making up for the gap with an online degree make a difference?

First of all, I raised the level of education to a level not found in domestic universities. Most of the lecture content was based on what I had heard at prestigious global universities and what my friends around me had heard, and the exam questions were raised to a level that even students at prestigious global universities would find challenging. There were many cases where students from prestigious domestic universities and those with master's or doctoral degrees from domestic universities thought it was a light degree because it was an online university, but ran away in shock. There was even a community post asking if . Once it became known that it was an online university, there was quite a stir in the English-speaking community.

I have definitely gained the experience of realizing that if you raise the difficulty level of education, the aspects that you lightly think of as online largely disappear. So, can there be a significant difference between online and offline in terms of student achievement?

Source=Swiss Institute of Artificial Intelligence

The table above is an excerpt from a study conducted to determine whether the test score gap between students who took classes online and students who took classes offline was significant. In the case of our school, we have never run offline lectures, but a similar conclusion has been drawn from the difference in grades between students who frequently visited offline and asked many questions.

First, in (1) – OLS analysis above, we can see that students who took online classes received grades that were about 4.91 points lower than students who took offline classes. Various conditions must be taken into consideration, such as the student's level may be different, the student may not have studied hard, etc. However, since it is a simple analysis that does not take into account any consideration, the accuracy is very low. In fact, if students who only take classes online do not go to school due to laziness, their lack of passion for learning may be directly reflected in their test scores, but this is an analysis value that is not reasonably reflected.

To solve this problem, in (2) – IV, the distance between the offline classroom and the students' residence was used as an instrumental variable that can eliminate the external factor of students' laziness. This is because the closer the distance is, the easier it will be to take offline classes. Even though external factors were removed using this variable, the test scores of online students were still 2.08 points lower. After looking at this, we can conclude that online education lowers students' academic achievement.

However, a question arose as to whether it would be possible to leverage students' passion for studying beyond simple distance. While looking for various variables, I thought that the number of library visits could be used as an appropriate indicator of passion, as it is expected that passionate students will visit the library more actively. The calculation transformed into (3) - IV showed that students who diligently attended the library received 0.91 points higher scores, and the decline in scores due to online education was reduced to only 0.56 points.

Another question that arises here is how close the library is to the students' residences. Just as the proximity to an offline classroom was used as a major variable, the proximity of the library is likely to have had an effect on the number of library visits.

So (4) – After confirming that students who were assigned a dormitory by random drawing using IV calculations did not have a direct effect on test scores by analyzing the correlation between distance from the classroom and test scores, we determined the frequency of library visits among students in that group. and recalculated the gap in test scores due to taking online courses.

(5) – As shown in IV, with the variable of distance completely removed, visiting the library helped increase the test score by 2.09 points, and taking online courses actually helped increase the test score by 6.09 points.

As can be seen in the above example, the basic simple analysis of (1) leads to a misleading conclusion that online lectures reduce students' academic achievement, while the calculation in (5) after readjusting the problem between variables shows that online lectures reduce students' academic achievement. Students who listened carefully to lectures achieved higher achievement levels.

This is consistent with actual educational experience: students who do not listen to video lectures just once, but take them repeatedly and continuously look up various materials, have higher academic achievement. In particular, students who repeated sections and paused dozens of times during video playback performed more than 1% better than students who watched the lecture mainly by skipping quickly. When removing the effects of variables such as cases where students were in a study group, the average score of fellow students in the study group, score distribution, and basic academic background before entering the degree program, the video lecture attendance pattern is simply at the level of 20 or 5 points. It was not a gap, but a difference large enough to determine pass or fail.

Not because it is online, but because of differences in students’ attitudes and school management

The conclusion that can be confidently drawn based on actual data and various studies is that there is no platform-based reason why online education should be undervalued compared to offline education. The reason for the difference is that universities are operating online education courses as lifelong education centers to make additional money, and because online education has been operated so lightly for the past several decades, students approach it with prejudice.

In fact, by providing high-quality education and organizing the program in a way that it was natural for students to fail if they did not study passionately, the gap with offline programs was greatly reduced, and the student's own passion emerged as the most important factor in determining academic achievement.

Nevertheless, completely non-face-to-face education does not help greatly in increasing the bond between professors and students, and makes it difficult for professors to predict students' academic achievement because they cannot make eye contact with individual students. In particular, in the case of Asian students, they rarely ask questions, so I have experienced that it is not easy to gauge whether students are really following along well when there are no questions.

A supplementary system would likely include periodic quizzes and careful grading of assignment results, and if the online lecture is being held live, calling students by name and asking them questions would also be a good idea.

Research Category

Member for

6 months
Real name
Keith Lee
Position
Professor

관련기사