AI/DS Column | Global Institute of Artificial Intelligence

AI Pessimism, just another correction of exorbitant optimism

Member for

7 months 1 week

Real name

Ethan McGowan

Position

Professor

Bio

Founding member of GIAI & SIAI
Professor of Data Science @ SIAI

입력

2024-08-01 03:48

수정

2024-11-26 17:08

AI talks turned the table and become more pessimistic
It is just another correction of exorbitant optimism and realisation of AI's current capabilities
AI can only help us to replace jobs in low noise data
Jobs needing to find new patterns and from high noise data industry, mostly paid more, will not be replaceable by current AI

There have been pessimistic talks about the future of AI recently that have created sudden drops in BigTech firms' stock prices. In all of a sudden, all pessimistic talks from Investors, experts, and academics in reputed institutions are re-visited and re-evaluated. They claim that ROI (Return on Investment) for AI is too low, AI products are too over-priced, and economic impact by AI is minimal. In fact, many of us have raised our voices for years with the exactly same warnings. 'AI is not a magic wand'. 'It is just correlation but not causality / intelligence'. 'Don't be overly enthusiastic about what a simple automation algorithm can do'.

As an institution with AI in our name, we often receive emails from a bunch of 'dreamers' that they wonder if we can make a predictive algorithm that can foretell stock price movements with 99.99% accuracy. If we could do that, why do you think we would share the algorithm with you? We should probably keep it for secret and make billions of dollars just for ourselves. As much as the famous expression by Milton Friedman, a Nobel economist, there is no such thing as a free lunch. If we have a perfect predictability and it is widely public, then the prediction is no longer a prediction. If everyone knows the stock A's price goes up, then everyone would buy the stock A, until it reaches to the predicted value. Knowing that, the price will jump to the predicted value, almost instantly. In other words, the future becomes today, and no one gets benefited.

AI = God? AI = A machine for pattern matching

A lot of enthusiasts have exorbitant optimism that AI can overwhelm human cognitivie capacity and soon become god-like feature. Well, the current forms of AI, be it Machine Learning, Deep Learning, and Generative AI, are no more than a machine for pattern matching. You touch a hot pot, you get a burn. It is painful experience, but you learn that you should not touch when it is hot. The worse the pain, the more careful you become. Hopefully it does not make your skin irrecoverable. The exact same pattern works for what they call AI. If you apply the learning processes dynamically, that's where Generative AI comes. The system is constantly adding more patterns into the database.

Though the extensive size of patterns does have great potential, it does not mean that the machine has cognitive capacity to understand the pattern's causality and/or to find new breakthrough patterns from list of patterns in the database. As long as it is nothing more than a pattern matching system, it never will.

To give you an example, can it be used what words you are expected to answer in a class that has been repeated for thousand times? Definitely. Then, can you use the same machine to predict the stock price? Aren't the stock market repeating the same behavior over a century? Well, unfortunately it is not, thus you can't be benefited by the same machine for financial investments.

Two types of data - Low noise vs. High noise

On and near the Wall Street, you can sometimes meet an excessively confident hedge fund manager with claims on near perfect foresight for financial market movements. Some of them have outstanding track records, and surprisingly persuasive. In New York Times archive back in 1940s, or even as early as 1910s, you can see people with similar claims were eventually sued by investors, arrested due to false claims, and/or just disappeared from the street within a few years. If they were that good, why then they lost money and got sued/arrested?

There are two types of data. One set of data that you can see from machine (or highly controlled environment) is called 'Low-noise' data. It has high predictability. Even in cases where embedded patterns are invisible by bare eyes, you either need more analytic brain or a machine to test all possibilities within the possible sets. For the game of Go, the brain was Se-dol Lee and the machine was Alpha-Go. The game needs to test 19x19 possible sets with around 300 possible steps. Even if your brain is not as good as Se-dol Lee, as long as your computer can find the winning patterns, you can win. This is what has been witnessed.

The other set of data comes from largely uncontrolled environment. There potentially is a pattern, but it is not the single impetus that drives every motion of the space. There are thousands, if not millions, of patterns that the driver is not observable. This is where randomness is needed for modeling, and it is unfortunately impossible to predict accurate move, because the driver is not observable. We call this set of data 'High-noise'. The stock market is the very example of such. There are millions of unknown, unexpectable, and at least unmeasurable influences that disable any analyst or machine to predict with accuracy level upto 100%. This is why financial models are not researched for predictability but used only to backtest financial derivatives for reasonable pricing.

Natural language process (NLP) is one example of low noise. Our language follows a certain set of rules (or patterns), which are called grammar. Unless you are uneducated or intentionally out of grammar (or make mistakes), people generally follow grammar. Weather is mostly low noise, but it has high noise components. Sometimes typhoons are unpredictable, or less predictable. Stock market? Be my guest. There have been 4 Nobel Prizes given to financial economists by year 2023, and all of them are based on the belief that stock markets follow random processes, be it Gaussian, Poisson, and/or any other unknown random distributions. (Just in case, if a process follows any known distribution, that means it is probabilistic, which means it is random.)

Potential benefits of AI

We as an institution hardly believe current forms of AI will make any significant changes in businesses and our life in short term. The best we can expect is automation of mundane tasks. Like laundary machine in early 20th century. ChatGPT already has shown us a path. Soon, CS operators will largely be replaced by LLM based chatbots. US companies actively outsourced the function from India for the past a few decades, thanks to cheaper international connectivity via internet. It will still remain, but human actions will be needed way less than before. In fact, we already get machine generated answers from a number of international services. If we complain about a program's malfunction on a WordPress plugin, for instance, established services email us machine answers first. For a few cases, it actually is enough. The practice will become more popular to less-established services as it becomes easier and cheaper to implement.

Teamed up with EduTimes, we also are working on a research to replace 'Copy Boys/Girls'. Journalists that we know from large news magazines are not always running on the street to find new and fascinating stories. In fact, most of them read other newspapers and rewrite the contents as if they were the original sources. Although it is not an important job, it is still needed for the newspaper to run. They need to keep up the current events, accoring to the EduTimes journalists from other renouned newspapers. The copy team is usually paid the least and seen a death sentence as a journalist. What makes the job more sympathetic on top of the least respect, it will soon be replaced by LLM based copywriters.

In fact, any job that generates patterned contents without much of cognitivie functions will gradually be replaced.

What about automotive driving? Is it a low-noise pattern job or a high-noise complicated cognitive job? Well, although Elon Musk claims high possibility of Lv. 4 auto-driving within next a few years, we don't believe so. None of us at GIAI have seen any game theorists have solved multi-agent ($n$>2) Bayesian belief game with imperfect information and unknown agent types by computer so that the automotive driving algorithm can predict what other drivers on the road will do. Without the right prediction of others on the fast moving vehicles, it is hard to tell if your AI will help you successfully avoid other crazy drivers. The driving job for those eventful cases needs 'instinct', which requires another set of bodily function different from cognitive intelligence. The best that the current algorithm can do is to tighten it up to perfection for a single car, which already needs to go over a lot of mathematical, mechanical, organisational, legal, and commercial (and many more) challenges.

Don't they know all that? Aren't the Wall Street investors self-confident, egocentric, but ultra smart that they already know all the limitations of AI? We believe so. At least we hope so. Then, why do they pay attention to the discontentful pessimism now, and create heavy drops in tech stock prices?

Guess the Wall Street hates to see Silicon Valley to be paid too much. American East often think the West too unrealistic and floating in the air. OpenAI's next round funding may surprise us in a totally opposite direction.

Research Category

AI/DS Column

Member for

7 months 1 week

Real name

Ethan McGowan

Position

Professor

Bio

Founding member of GIAI & SIAI
Professor of Data Science @ SIAI

Why Companies cannot keep the top-tier data scientists / Research Scientists?

Member for

8 months

Real name

Keith Lee

Position

Professor

입력

2024-07-29 06:07

수정

2024-11-26 17:04

Top brains in AI/Data Science are driven to challenging jobs like modeling
Seldom a 2nd-tier company, with countless malpractices, can meet the expectations
Even with $$$, still they soon are forced out of AI game

A few years ago, a large Asian conglomerate acquired a Silicon Valley's start-up just off an early Series A funding. Let's say it is start-up $\alpha$. The M&A team leader later told me that the acquisition was mostly to hire the data scientist in the early stage start-up, but the guy left $\alpha$ on the day the M&A deal was announced.

I had an occation to sit down with the data scientist a few months later, and asked him why. He tried to avoide the conversation, but it was clear that the changing circumstances definitely were not within his expectation. Unlike other bunch of junior data scientists in Silicon Valley's large firms, he did signal me his grad school training in math and stat that I had a pleasant half an hour talk about models. He was mal-treated in large firms that he was given to run SQL queries and build Tableau-based graphes, like other juniors. His PhD training was useless in large firms, so he had decided to be a founding member of $\alpha$ that he can build models and test them with live data. The Asian acquirer with bureaucratic HR system wanted him to give up his agenda and to transplant the Silicon Valley large firm's junior data scientist training system to the acquirer firm.

Brains go for brains

Given tons of other available positions, he didn't waste his time. Personall,y I also have lost some months of my life for mere SQL queries and fancy graphes. Well, some people may still go for 'data scientist' title, but I am my own man. So was the data scientist from $\alpha$.

These days, Silicon Valley firms call the modelers as 'research scientists', or simliar names. There also are positions called 'machine learning engineers' whose jobs somewhat related to 'research scientists', but may disinclude mathematical modeling parts and way more software engineering in it. The title 'Data Scientists' are now given to jobs that were used to be called 'SQL monkeys'. As the old nickname suggests, not that many trained scientists would love to do the job, even with competitive salary package.

What companies have to understand is that we, research scientists, are not trained for SQL and Tableau, but mathematical modeling. It's like a hard-trained sushi cook(将太の寿司, shota no sushi) is given to make street food like Chinese noodle.

Let me give you an example in real corporate world. Let's say a semi-conductor company, $\beta$ wants to build a test model for a wafer / subsctrate. What I often hear from those companeis are that they build a CNN model that reads the wafer's image and match it with pre-labeled 0/1 for error detection. In fact, similar practices have been widely adapted practice among all Neural Network maniacs. I am not saying it does not work. It works. But then, what would you do, if the pre-label was done poorly? Say, the 0/1 entries were like over 10,000 and hardly any body double checked the accruracy. Can you rely on that CNN-based model? In addition to that, the model probably require enourmous amount of computational costs to build, let alone test and operating it daily.

Wrong practice that drives out brain

Instead of the costly and less scientific option, we can always build a model that captures data's generated process(DGP). The wafer is composed of $n \times k$ entries, and issues emerge when $n \times 1$ or $1 \times k$ entries go wrong altogether. Given the domain knowledge, one can build a model with cross-products between entries in the same row/column. If it is continuously 1 (assume 1 for error), then it can easily be identified as a defect case.

Cost of building a model like that? It just needs your brain. There is a good chance that you don't even need a dedicated graphics card for that calculation. Maintenance costs are also incomparably smaller than the CNN version. The concept of computational cst is something that you were supposed to learn in any scientific programming classes at school.

For companies sticking to the expensive CNN options, I always can spot followings:

The management has little to no sense of 'computational cost'
The manaement cannnot discern 'research scientists' and 'machine learning engineers'
The company is full of engineers without the sense of mathematical modeling

If you want to grow up as a 'research scientist', just like the guy at $\alpha$, then run. If you are smart enough, you must have already run, like the guy at $\alpha$. After all, this is why many 2nd-tier firms end up with CNN maniacs like $\beta$. Most 2nd-tier firms are unlucky that they cannot keep research scientists due to lack of knowledge and experience. Those companies have to spend years of time and millions of wasted dollars to find that they were so long. By the time that they come to senses, it is mostly already way too late. If you are good enough, don't waste your time on a sinking ship. The management needs so-called cold-turkey type shock treatment as a solution. In fact, there was a start-up that I stayed only for a week, which lost at least one data scientist in everyweek. The company went to bankrupt in 2 years.

What to do and not to do

At SIAI, I place Scientific Programming right after elementary math/stat training. Students see that each calculation method is an invention to overcome earlier available options' limitations but simultanesouly the modification bounds the new tactic in another directions. Neural Networks are just one of the many kinds. Even with the eye-opening experience, some students still remain NN maniacs, and they flunk in Machine Learning and Deep Learning classes. Those students believe that there must exist a grand model that is univerally superior to all other models. I wish the world is that simple, but my ML and DL courses break the very belief. Those who are awaken, usually become excellent data/research scientists. Many of them come back to me that they were able to minimize computational costs by 90% just by replacing blindly implemented Neural Network models.

Once they see that dramatic cost reduction, at least some people understand that the earlier practice was wrong. The smarty student may not be happy to suffer from poor management and NN maniacs for long. Just like the guy at $\alpha$, it is always easier to change your job than fighting to change your incapable management. Managers moving fast maybe able to withhold the smarty. If not, you are just like the $\beta$. You invest a big chunk of money for an M&A just to hire a smarty, but the smarty disappears.

So, if you want to keep the smarty? Your solution is dead simple. Test math/stat training levels in scientific programming. You will save tons of $$$ in graphic card purchase.

Research Category

AI/DS Column

Member for

8 months

Real name

Keith Lee

Position

Professor

Following AI hype vs. Studying AI/Data Science

Member for

8 months

Real name

Keith Lee

Position

Professor

입력

2024-04-27 04:42

수정

2024-11-26 16:53

People following AI hype are mostly completely misinformed
AI/Data Science is still limited to statistical methods
Hype can only attract ignorance

As a professor of AI/Data Science, I from time to time receive emails from a bunch of hyped followers claiming what they call 'recent AI' can solve things that I have been pessimistic. They usually think 'recent AI' is close to 'Artificial General Intelligence', which means the program learns by itself and it is beyond human intelligence level.

At the early stage of my start-up business, I answered them with quality contents. Soon, I realized that they just want to hear what they want to hear, and criticize people saying what they don't want to hear. Last week, I came across a Scientific American's article about history of automations that were actually people. (Is There a Human Hiding behind That Robot or AI? A Brief History of Automatons That Were Actually People | Scientific American)

AI hype followers' ungrounded dream for AGI

No doubt that many current AI tools are far more advanced than medieval 'machines' that were discussed in the Scientific American article, but human generated AI tools are still limited to pattern finding and abstracting it by featuring common parts. The process requires to implement a logic, be it human found or human's programmed code found, and unfortunately the machine codes that we rely on is still limited to statistical approaches.

AI hype followers claim that recent AI tools have already overcome needs for human intervention. The truth is, even Amazon's AI checkout that they claimed no human casher is needed is founded to be under large number of human inspectors, according to the aforementioned Scientific American article.

As far as I know, 9 out 10, in fact 99 out of 100 research papers in second tier (or below) AI academic journals are full of re-generation of a few leading papers on different data sets with only a minor change.

The leading papers in AI, like all other fields, change computational methodologies for a fit to new set of data and different purposes, but the technique is unique and it helps a lot of unsolved issues. Going down to second tier or below, it is just a regeneration, so top class researchers usually don't waste time on them. The problem is that even the top journals are not open only for ground breaking papers. There are not that many ground breaking papers, by definition. We mostly just go up one by one, which is already ultra painful.

Going back to my graduate studies, I tried to establish a model for high speed of information flow among financial investors that leads them to follow each other and copy the winning model, which results in financial market overshooting (both hype/crash) at an accelerated speed. The process of information sharing that results in suboptimal market equilibrium is called 'Hirshleifer effect'. Modeling that idea into an equation that fits to a variety of cases is a demanding task. Every researcher has one's own opinion, because they need to solve different problems and they have different backgrounds. Unlikely we will end up with one common form for the effect. This is how the science field works.

Hype that attracts ignorance

People outside of research, people in marketing to raise AI hype, and people unable to understand researches but can understand marketers' catchphrases are those people who frustrate us. As mentioned earlier, I did try to persuade them that it is only a hype and the reality is far from the catch lines. I have given up doing so for many years.

Friends of mine who have not pursued grad school sometimes claim that they just need to test the AI model. For example, if an AI engineer claims that his/her AI can win against wall street's top-class fund managers by double to tripple margin, my friends think all they need as a venture capitalist is to test it for a certain period of time.

The AI engineer may not be smart enough to show you failed result. But a series of failed funding attempts will make him smarter. From a certain point, I am sure the AI engineer begins showing off successful test cases only, from the limited time span. My VC friends will likely be fooled, because there is not such an algorithm that can win against market consistently. If I had that model, I would not go for VC funding. I would set up a hedge-fund or I will just trade with my own money. If I know that I can win with 100% probability and zero risk, why share profit with somebody else?

The hype disappears not by a few failed tests, but by no budget in marketing

Since many ignorant VCs are fooled, the hype continues. Once the funding is secured, the AI engineer runs more marketing tools to show off so that potential investors are brain-washed by the artificial success story.

As the test failed multiple times, the actual investments with fund buyers' money also fails. Clients begin complaining, but the hype is still high and the VC's funding is not dry yet. In addition to that, now the VC is desperate to raise the invested AI start-up's value. He/She also lies. The VC maybe uninformed of the failed tests, but it is unlikely that he/she hears complains from angry clients. The VC's lies, however unintentional, support the hype. The hype goes on. Until when?

The hype becomes invisible when people stop talking about. When people stop talk about it? If the product is not new anymore? Well, maybe. But for AI products, if it has no real use cases, then people finally understand that it was all marketing hype. The less clients, and the less words of mouth. To pump up dying hype, the company may put in more budget to marketing. They do so, until it completely runs out of cash. At some point, there is no ad, so people just move onto something else. Finally, the hype is gone.

Then, AI hype followers no longer send me emails with disgusting and silly criticism.

Following AI hype vs. Studying AI/Data Science

On the contrary, there are some people determined to study this subject in-depth. They soon realize that copying a few lines of program codes on Github.com does not make them experts. They may read a few 'tech blogs' and textbooks, but the smarter they are, the faster they catch that it requires loads of mathematics, statistics, and hell more scientific backgrounds that they have not studied from college.

They begin looking for education programs. For the last 7~8 years, a growing number of universities have created AI/Data Science programs. At the very beginning, many programs were focused too much on computer programming, but by the competition of coding boot-camps and accreditational institutions' drive, most AI/Data Science programs in US top research schools (or similar level schools in the world) offer mathematically heavy courses.

Unfortunately, many students fail, because math and stat required to professional data scientists is not just copying a few lines of program codes from Github.com. My institution, for example, runs Bachelor level courses for AI MBA and MSc AI/Data Science for more qualified students. Most students know the MSc is superior to AI MBA, but only few can survice. They can't even understand AI MBA's courses that are par to undergrad. Considering US top schools' failing rates in STEM majors, I don't think it is a surprise.

Those failing students are still better than AI hype followers, so highly unlikely be fooled like my ignorant VC friends, but they are unfortunately not good enough to earn a demaing STEM degree. I am sorry to see them walk away from the school without a degree, but the school is not a diploma mill.

The distance from AI hype to professional data scientists

Graduated students with a shining transcript and a quality dissertation find decent data scientist positions. Gives me a big smile. But then, in the job, sadly most of their clients are mere AI hype followers. Whenever I attend alum gathering, I get to hear tons of complaints from students about the work environment.

It sounds like a Janus-face case to me. On the one side, the company officials hires data scientists because they follow AI hype. They just don't know how to make AI products. They want to make the same or the better AI products than competitors. The AI hype followers with money create this data scientist job market. On the other side, unfortunately the employers are even worse than failing students. They hear all kinds of AI hype, and they just believe all of them. Likely, the orders given by the employers will be far from realistic.

Had the employers had the same level knowledge in data science as me, would they have hired a team of data scientists for products that cannot be engineered? Had they known that there is no AI algorithm that can consistently win against financial markets, would they have invested to the AI engineer's financial start-up?

I admit that there are thousands of unsung heros in this field without much consideration from the market due to the fact that they have never jumped into this hype marketing. The capacity of those teams must be the same as or even better than world class top-notch researchers. But even with them, there are things that can be done and cannot be done by AI/Data Science.

Hype can only attract ignorance.

Research Category

AI/DS Column

Member for

8 months

Real name

Keith Lee

Position

Professor

Why was 'Tensorflow' a revolution, and why are we so desperate to faster AI chips?

Member for

8 months

Real name

Keith Lee

Position

Professor

입력

2024-04-11 00:00

수정

2024-11-26 16:48

Transition from column to matrix, matrix to tensor as a baseline of data feeding changed the scope of data science, 
but faster results in 'better', only when we apply the tool to the right place, with right approach.

Back in early 2000s, when I first learned Matlab to solve basic regression problems, I was told Matlab is the better programming tool because it runs data by 'Matrix'. Instead of other software packages that feed data to computer system by 'column', Matlab loads data with larger chunk at once, which accelerates processing speed by O(nxk) to O(n). More precisely, given how the RAM is fed by the softwares, it was essentially O(k) to O(1).

Together with a couple of other features, such as quick conversion of Matlab code to C code, Matlab earned huge popularity. A single copy was well over US$10,000, but companies with deep R&D and universities with significant STEM research facilities all jumped to Matlab. While it seemed there were no other competitors, there was a rising free alternative, called R, that had packages handling data just like Matlab. R also created its own data handler, which worked faster than Matlab for loop calculation. What I often call R-style (like Gangnam style), replaced loop calculations from feeding column to matrix type single process.

R, now called Posit, became my main software tool for research, until I found it's failure to handling imaginary numbers. I had trouble reconciliating R's outcome with my hand-driven solution and Matlab's. Later, I ended up with Mathematica, but given the price tag attached to Mathematica, I still relied on R for communicating with research colleagues. Even after prevailing Python data packages, upto Tensorflow and PyTorch, I did not really bother to code in Python. Tensorflow was (and is) also available on R, and there was not that much speed improvement in Python. If I wanted faster calculation for multi-dimensional tasks that require Tensorflow, I coded the work in Matlab, and transformed to C. There initially was a little bug, but the Matlab's price tag did worth the money.

A few years back, I found Julia, which has similar grammar with R and Python, but with C-like speed in calculations with support for numerous Python packages. Though I am not an expert, but I feel more conversant with Julia than I do to Python.

When I pull this story, I get questions like wy I traveled around multiple software tools? Have my math models become far more evolved that I required other tools? In fact, my math models are usually simple. At least to me. Then, why from Matlab to R, Mathematica, Python, and Julia?

Since I only had programming experience from Q-Basic, before Matlab, I really did not appreciate the speed enhancement by 'Matrix'-based calculations. But when I switched to R, for loops, I almost cried. It almost felt like Santa's Christmas package had a console gamer that can play games that I have dreamed of for years. I was able to solve numerous problems that I had not been able to, and the way I code solution also got affected.

The same transition affected me when I first came across 'Tensorflow'. I am not a computer scientist, so I do not touch image, text, or any other low-noise data, so the introduction of tensorflow by computer guys failed to earn my initial attention. However, on my way back, I came to think of the transition from Matlab to R, and similar challenges that I had had trouble with. There were a number of 3D data sets that I had to re-array them with matrix. There were infinitely many data sets in shape of panel data and multi-sourced time series.

When in search for right stat library that can help solving my math problems in simple functions, R usually was not my first choice. It was mathematica, and it still is, but since the introduction of tensorflow, I always think of how to leverage 3D data structure to minimize my coding work.

Once successful, it not only helps me to save time in coding, but it tremendously changes my 'waiting' time. During my PhD, for one night, the night before supposed meeting with my advisor, I found a small but super mega important error in my calculation. I was able to re-derive closed solutions, but I was absolutely sure that my laptop won't give me a full-set simulation by the next morning. I cheated with the simulation and created a fake graph. My advisor was a very nice guy to pinpoint something was wrong with my simluation within a few seconds. I confessed. I was too in a hurry, but I should've skipped that week's meeting. I remember it took me years to earn his confidence. With faster machine tools that are available these days, I don't think I should fake my simulation. I just need my brain to process faster, more accurately, and more honestly.

After the introduction of H100, many researchers in LLM feel less burden on handling massive size data. As AI chips getting faster, the size of data that we can handle at the given amount of time will be increasing with exponential capacity. It will certainly eliminate cases like my untruthful communication with the advisor, but I always ask myself, "Where do I need hundreds of H100?"

Though I do appreicate the benefits of faster computer processing and I do admit that the benefits of cheaper computational cost that opens opportunities that have not been explored, it still needs to answer 'where' and 'why' I need that.

Research Category

AI/DS Column

Member for

8 months

Real name

Keith Lee

Position

Professor

The process of turning web novels into webtoons and data science

Member for

8 months

Real name

Keith Lee

Position

Professor

입력

2023-11-20 05:25

수정

2024-11-26 16:44

Web novel to Webtoon conversion is not only based on 'profitability'
If the novel author is endowed with money or bargaining power, 'Webtoonization' may be nothing more than a marketting tool for the web novel.
Data science modeling based on market variables unable to grab such cases

A student in SIAI's MBA AI/BigData progam, struggling with her thesis, chose her topic as the condition for turning a web novel into a webtoon. In general, people would simply think that if the number of views is high and the sales volume of the web novel is large, a follow-on contract with a webtoon studio will be much easier. She brought in a few reference data science papers, but they only looked into publicly available information. What if the conversion was the choice of the web novel author? What if the author just wanted to spend more marketing budget by adding webtoon in his line-up?

Literature mostly runs hierarchical structures during 'deep learning' and use 'SVM', a task that simply relies on computer calculations, and calculate the number of all cases provided by the Python library. Sorry to put it this way, but such calculations are nothing more than a waste of computer resources. It has also been pointed out that the crude reports of such researchers are still registered as academic papers.

Put all crawled data into 'AI', then it will swing a majic wand?

Converting a web novel into a webtoon can be seen as changing a written story book into an illustrated story book. Professor Daeyoung Lee, Dean of the Graduate School of Arts at Chung-Ang University, explained that the change to OTT is a change to video story books.

The reason this transition is not easy is because the transition costs are high. Domestic webtoon studios have a team of designers ranging from as few as 5 to as many as dozens of designers, and the market has been differentiated considerably into a market where even a small character image or pattern that seems simple to our eyes must be purchased and used. After paying all the labor costs and purchasing costs for characters, patterns, etc., it still takes $$$ to turn a web novel into a webtoon.

This is probably the mindset of typical 'business experts' to think that manpower and funds will be concentrated on web novels that seem to have a high possibility of success as webtoons, as investment money is invested and new commercialization challenges are required.

However, the market does not operate solely on the logic of capital, and 'plans' based on the logic of capital are often wrong due to failing to read the market properly. In other words, even if you create a model by collecting data such as the number of views, comments, and purchases provided by platforms and consider the possibility of webtoonization and the success of the webtoon, it is unlikely that it will actually be correct.

One thing to point out here is that although there are many errors due to market uncertainty, there are also a significant number of errors due to model inaccuracy.

Wrong data, wrong model

For those who simply think that 'deep learning' or 'artificial intelligence' will take care of it, creating a model incorrectly means using a less suitable algorithm when one of the 'deep learning' algorithms is said to be a better fit, or worse. It will result in the understanding that good artificial intelligence should be used, but less good artificial intelligence is used.

However, which 'deep learning' or 'artificial intelligence' is a good fit and which one is not a good fit is a matter of lower priority. What is really important is how accurately you can capture the market structure hidden in the data, so you must be able to verify whether it fits well not only by chance in the data selected today, but also consistently fits well in the data selected in the future. Unfortunately, we have already seen for a long time that most 'artificial intelligence'-related papers published in Korea intentionally select and compare data from well-matched time points, and professors' research capabilities are judged simply by the number of K-SCI papers, and the papers are compared. We cannot help but point out that proper verification is not carried out due to the Ministry of Education's crude regulations regarding which academic journals that appear frequently are good journals.

The calculation known as 'deep learning' is simply one of the graph models that finds nonlinear patterns in a more computationally dependent manner. In natural language that must be used according to grammar, computer games that must be operated according to rules, etc., there may be no major problems in use because the probability of errors in the data itself is close to 0%, but the above webtoonization process is not expected to respond in the market. There may be problems that are not resolved, and the decision-making process for webtoons is likely to be quite different from what an outsider would see.

Simply put, it can be pointed out that the barriers given to writers who already have a successful 'track record' are completely different from the barriers given to new writers. Kang Full, a writer who recently achieved great success with 'Moving', explained in an interview that he started with the intellectual property rights of webtoons from the beginning, and that he made major decisions during the transition to OTT. This is a situation that ordinary web novel and webtoon writers cannot even imagine. This is because most web novel and webtoon platforms can sell their content on the platform through contracts that retain intellectual property rights for secondary works.

How much of it is possible for an author to decide whether to make a webtoon or an OTT, reflecting his or her own will? If this proportion increases, what conclusion will the ‘deep learning’ model above produce?

The general public's way of thinking does not include cases where webtoons and OTT adaptations are carried out at the author's will. The 'artificial intelligence' models mentioned above will only explain what percentage of the 'logic of capital' that operates inside the web novel and webtoon platform is correct. However, as soon as the proportion of 'author's will' instead of 'logic of capital' is reflected increases, that model will judge the effects of variables we expected to be much lower, and conversely, it will appear as if the effects of unexpected variables are higher. In reality, it was simply because we failed to include an important variable called 'author's will' that should have been reflected in the model, but since we did not even consider that part, we only ended up with an absurd story with an absurd title of 'Webtoonization process informed by artificial intelligence'.

Before data collection, understand the market first

It has now been two months since the student brought that model. For the past two months, I have been asking her to properly understand the market situation to find the missing pieces in the webtoonization process.

From my experience with business, I have seen that even though the company thought that it could take on an interesting challenge with enough data, it could not proceed due to the lack of the ‘Chairman’s will’. On the other hand, companies that were completely unprepared or did not even have the necessary manpower said, ‘This is the story you heard from the Chairman.’ I've seen countless times where they come up with absurd project ideas saying they're going to proceed 'as usual', and then only IT developers are hired without data science experts, and the work of copying open libraries from overseas markets is repeated.

Considering the amount of capital and market conditions that are also required for the webtoonization process, it is highly likely that a significant number of webtoons will be included in web novel writers' new work contracts in the form of a 'bundle', which is naturally included to attract already successful web novel writers, and generate profits. In the case of writers who want to dominate the webtoon studio, they are likely to sign a contract with the webtoon platform by signing a contract with the webtoon studio themselves and starting to serialize the webtoon after the first 100 or 300 episodes of the web novel are released. From the perspective of a web novel writer who has already experienced that profits increase due to the additional promotion of the web novel as the webtoon is developed, there are cases where the webtoon product is viewed as one of the promotional strategies to sell their intellectual property (IP) at a higher price. It happens.

To the general public, this 'author's will' may seem like an exception, but even if the above proportion of web novels converted to webtoons exceeds 30%, it becomes impossible to explain webtoons using data collected through general thinking. In a situation where there are already various market factors that make it difficult to increase accuracy, and in a situation where more than 30% is driven by other variables such as 'the author's will' rather than 'market logic', how can data collected through general thinking lead to a meaningful explanation? Can I?

Data science is not about learning ‘deep learning’ but about building an appropriate model

In the end, it comes back to the point I always give to students. It is pointed out that 'we must understand reality and find a model that fits that reality.' In plain English, the expression changes to the need to find a model that fits the 'Data Generating Process (DGP)', but the explanatory model related to webtoonization above is a model that does not currently take 'DGP into consideration' at all. If scholars are in a situation where they are listening to the same presentation, complaints such as 'Who on earth selected the presenters' may arise, and there will be many cases where they will just leave even if they are criticized for being rude. This is because such an announcement itself is already disrespectful to the attendees.

In the above situation, in order to create a model that can be considered for DGP, you must have a lot of background knowledge about the web novel and webtoon markets. It does not reflect factors such as how web novel writers on major platforms communicate with platform managers, what the market relationship between writers and platforms is like, and to what extent and how the government intervenes, and simply inserts materials scraped from the Internet. There is no point in simply doing the work of ‘putting data into’ the models that appear in ‘artificial intelligence’ textbooks. If an understanding of the market can be derived from that data, it would be an attractive data work, but as I keep saying, if the data is not in the form of natural language that follows grammar or a game that follows rules, it will only be a waste of computer resources with no meaning. It's just that.

I don't know whether that student will be able to do some market research to destroy my counterargument at the meeting next month, or whether he will change the detailed structure of the model based on his understanding of the market, or worse, whether he will change the topic. What is certain is that a 'paper' with the name 'data' as a simple way to put the collected data into a coding library will end up being nothing more than a 'mixed-up code' containing only one's own delusions and a 'novel filled with text only'.

Research Category

AI/DS Column

Member for

8 months

Real name

Keith Lee

Position

Professor

Can a graduate degree program in artificial intelligence actually help increase wages?

Member for

8 months

Real name

Keith Lee

Position

Professor

입력

2023-10-31 13:00

수정

2024-11-26 16:41

Asian companies convert degrees into years of work experience
Without adding extra values to AI degree, it doesn't help much in salary
'Dummification' in variable change is required to avoid wrong conclusion

In every new group, I hide the fact that I have studied upto PhD, but there comes a moment when I have no choice but to make a professional remark. When I end up revealing that my bag strap is a little longer than others, I always get asked questions. They sense that I am an educated guy only through a brief conversation, but the question is whether the market actually values it more highly.

When asked the same question, it seems that in Asia they are usually sold only for their 'name value', and the western hemisphere, they seem to go through a very thorough evaluation process to see if one has actually studied more and know more, and are therefore more capable of being used in corporate work.

Typical Asian companies

I've met many Asian companies, but hardly had I seen anyone with a reasonable internal validation standard to measure one's ability, except counting years of schooling as years of work experience. Given that for some degrees, it takes way more effort and skillsets than others, you may come to understand that Asian style is too rigid to yield misrepresentation of true ability.

In order for degree education to actually help increase wages, a decent evaluation model is required. Let's assume that we are creating a data-based model to determine whether the AI degree actually helps increase wages. For example, a new company has grown a bit and is now actively trying to recruit highly educated talent to the company. Although there is a vague perception that the salary level should be set at a different level from the personnel it has hired so far, there is actually a certain level of salary. This is a situation worth considering if you only have very superficial figures about whether you should give it.

Asian companies usually end up only looking for comparative information, such as how much salary large corporations in the same industry are paying. Rather than specifically judging what kind of study was done during the degree program and how helpful it is to the company, the 'salary' is determined through simple separation into Ph.D, Masters, or Bachelors. Since most Asian universities have lower standard in grad school, companies separate graduate degrees by US/Europe and Asia. They create a salary table for each group, and place employees into the table. That's how they set salaries.

The annual salary structure of large companies that I have seen in Asia sets the degree program to 2 years for a master's and 5 years for a doctoral degree, and applies the salary table based on the value equivalent to the number of years worked at the company. For example, if a student who entered the integrated master's and doctoral program at Harvard University immediately after graduating from an Asian university and graduated after 6 years of hard work gets a job at an Asian company, the human resources team applies 5 years to the doctoral degree program. The salary range is calculated at the same level as an employee with 5 years of experience. Of course, since you graduated from a prestigious university, you may expect higher salary through various bonuses, etc., but as the 'salary table' structure of Asian companies has remained unchanged for the past several decades, it is difficult to avoid differenciating an employee with 6 years of experience with a PhD holder from a prestigious university.

I get a lot of absurd questions about whether it would be possible to find out by simply gathering 100 people with bachelor, master, and doctoral degree, finding out their salaries, and performing 'artificial intelligence' analysis. If the above case is true, then no matter what calculation method is used, be it highly computer resouce consuming recent calculation method or simple linear regression, as long as salary is calculated based on the annualization, it will not be concluded that a degree program is helpful. There might be some PhD programs that require over 6 years of study, yet your salary in Asian companies will be just like employees with 5 years experience after a bachelor's.

Harmful effects of a simple salary calculation method

Let's imagine that there is a very smart person who knows this situation. If you are a talented person with exceptional capabilities, it is unlikely that you will settle for the salary determined by the salary table, so a situation may arise where you are not interested in the large company. Companies looking for talent with major technological industry capabilities such as artificial intelligence and semiconductors are bound to have deeper concerns about salary. This is because you may experience a personnel failure by hiring people who are not skilled but only have a degree.

In fact, the research lab run by some passionate professors at Seoul National University operates by the western style that students have to write a decent dissertation if to graduate, regardless of how many years it takes. This receives a lot of criticism from students who want to get jobs at Korean companies. You can find various criticisms of the passionate professors on websites such as Dr. Kim's Net, which compiles evaluations of domestic researchers. The simple annualization is preventing the growth of proper researchers.

In the end, due to the salary structure created for convenience due to Asian companies lacking the capacity to make complex decisions, the people they hire are mainly people who have completed a degree program in 2 or 5 years in line with the general perception, ignoring the quality of thesis.

Salary standard model where salary is calculated based on competency

Let's step away from frustrating Asian cases. So you get your degree by competency. Let's build a data analysis in accordance with the western standard, where the degree can be an absolute indicator of competency.

First, you can consider a dummy variable that determines whether or not you have a degree as an explanatory variable. Next, salary growth rate becomes another important variable. This is because salary growth rates may vary depending on the degree. Lastly, to include the correlation between the degree dummy variable and the salary growth rate variable as a variable, a variable that multiplies the two variables is also added. Adding this last variable allows us to distinguish between salary growth without a degree and salary growth with a degree. If you want to distinguish between master's and doctoral degrees, you can set two types of dummy variables and add the salary growth rate as a variable multiplied by the two variables.

What if you want to distinguish between those who have an AI-related degree and those who have not? Just add a dummy variable indicating that you have an AI-related degree, and add an additional variable multiplied by the salary growth rate in the same manner as above. Of course, it does not necessarily have to be limited to AI, and various possibilities can be changed and applied.

One question that arises here is that each school has a different reputation, and the actual abilities of its graduates are probably different, so is there a way to distinguish them? Just like adding the AI-related degree condition above, just add one more new dummy variable. For example, you can create dummy variables for things like whether you graduated from a top 5 university or whether your thesis was published in a high-quality journal.

If you use the ‘artificial intelligence calculation method’, isn’t there a need to create dummy variables?

The biggest reason why the above overseas standard salary model is difficult to apply in Asia is that it is extremely rare for the research methodology of advanced degree courses to actually be applied, and it is also very rare for the value to actually translate into company profits.

In the above example, when data analysis is performed by simply designating a categorical variable without creating a dummy variable, the computer code actually goes through the process of transforming the categories into dummy variables. In the machine learning field, this task is called ‘One-hot-encoding’. However, when 'Bachelor's - Master's - Doctoral' is changed to '1-2-3' or '0-1-2', the weight in calculating the annual salary of a doctoral degree holder is 1.5 times that of a master's degree holder (ratio of 2-3). , or an error occurs when calculating by 2 times (ratio of 1-2). In this case, the master's degree and doctoral degree must be classified as independent variables to separate the effect of each salary increase. If the wrong weight is entered, in the case of '0-1-2', it may be concluded that the salary increase rate for a doctoral degree falls to about half that of a master's degree, and in the case of '1-2-3', the same can be said for a master's degree. , an error is made in evaluating the salary increase rate of a doctoral degree by 50% or 67% lower than the actual effect.

Since 'artificial intelligence calculation methods' are essentially calculations that process statistical regression analysis in a non-linear manner, it is very rare to avoid data preprocessing, which is essential for distinguishing the effects of each variable in regression analysis. Data function sets (library) widely used in basic languages such as Python, which are widely known, do not take all of these cases into consideration and provide conclusions at the level of non-majors according to the situation of each data.

Even if you do not point out specific media articles or the papers they refer to, you may have often seen expressions that a degree program does not significantly help increase salary. After reading such papers, I always go through the process of checking to see if there are any basic errors like the ones above. Unfortunately, it is not easy to find papers in Asia that pay such meticulous attention to variable selection and transformation.

Obtaining incorrect conclusions due to a lack of understanding of variable selection, separation, and purification does not only occur among Korean engineering graduates. While recruiting developers at Amazon, I once heard that the number of string lengths (bytes) of the code posted on Github, one of the platforms where developers often share code, was used as one of the variables. This is a good way to judge competency. Rather than saying it was a variable, I think it could be seen as a measure of how much more care was taken to present it well.

There are many cases where many engineering students claim that they simply copied and pasted code from similar cases they saw through Google searches and analyzed the data. However, there may be cases in the IT industry where there are no major problems if development is carried out in the same way. As in the case above, in areas where data transformation tailored to the research topic is essential, statistical knowledge at least at the undergraduate level is essential, so let's try to avoid cases where advanced data is collected and incorrect data analysis leads to incorrect conclusions.

Research Category

AI/DS Column

Member for

8 months

Real name

Keith Lee

Position

Professor

Did Hongdae's hip culture attract young people? Or did young people create 'Hongdae style'?

Member for

8 months

Real name

Keith Lee

Position

Professor

입력

2023-10-31 12:00

수정

2024-11-26 16:36

The relationship between a commercial district and the concentration of consumers in a specific generation mostly is not by causal effect
Simultaneity oftern requires instrumental variables
Real cases also end up with mis-specification due to endogeneity

When working on data science-related projects, causality errors are common issues. There are quite a few cases where the variable thought to be the cause was actually the result, and conversely, the variable thought to be the result was the cause. In data science, this error is called ‘Simultaneity’. The first place where related research began was in econometrics, which is generally referred to as the three major data endogeneity errors along with loss of important data (Omitted Variable) and data inaccuracy (Measurement error).

As a real-life example, let me bring in a SIAI's MBA student's thesis . Based on the judgment that the commercial area in front of Hongik University in Korea would have attracted young people in their 2030s, the student hypothesized that by finding the main variables that attract young people, it would be possible to find the variables that make up the commercial area where young people gather. If the student's assumptions are reasonable, those who analyze commercial districts in the future will be able to easily borrow and use the model, and commercial district analysis can be used not only for those who want to open only small stores, but also for various areas such as promotional marketing of consumer goods companies, street marketing of credit card companies, etc.

Simultaneity error

However, unfortunately, it is not the commercial area in front of Hongdae that attracts young people in their 2030s, but a group of schools such as Hongik University and nearby Yonsei University, Ewha Womans University, and Sogang University that attract young people. In addition, the subway station one of the transportation hubs in Seoul. The commercial area in front of Hongdae, which was thought to be the cause, is actually the result, and young people in their 2030s, who were thought to be the result, may be the cause. In cases of such simultaneity, when using regression analysis or various non-linear regression models that have recently gained popularity (e.g. deep learning, tree models, etc.), it is likely that the simultaneity either exaggerates or under-estimates explanatory variables' influence.

The field of econometrics has long introduced the concept of ‘instrumental variable’ to solve such cases. It can be one of the data pre-processing tasks that removes problematic parts regardless of any of the three major data internal error situations, including parts where causal relationships are complex. Since the field of data science was recently created, it has been borrowing various methodologies from surrounding disciplines, but since its starting point is the economics field, it is an unfamiliar methodology to engineering majors.

In particular, people whose way of thinking is organized through natural science methodologies such as mathematics and statistics that require perfect accuracy are often criticized as 'fake variables', but the data in our reality has various errors and correlations. As such, it is an unavoidable calculation in research using real data.

From data preprocessing to instrumental variables

Returning to the commercial district in front of Hongik University, I asked the student "Can you find a variable that is directly related to the simultaneous variable (Revelance condition) but has no significant relationship (Orthogonality condition) with the other variable among the complex causal relationship between the two? One can find variables that have an impact on the growth of the commercial district in front of Hongdae but have no direct effect on the gathering of young people, or variables that have a direct impact on the gathering of young people but are not directly related to the commercial district in front of Hongdae.

First of all, the existence of nearby universities plays a decisive role in attracting young people in their 2030s. The easiest way to find out whether the existence of these universities was more helpful to the population of young people, but is not directly related to the commercial area in front of Hongdae, is to look at the youth density by removing each school one by one. Unfortunately, it is difficult to separate them individually. Rather, a more reasonable choice of instrumental variable would be to consider how the Hongdae commercial district would have functioned during the COVID-19 period when the number of students visiting the school area while studying non-face-to-face has plummeted.

In addition, it is also a good idea to compare the areas in front of Hongik University and Sinchon Station (one station to east, which is another symbol of hipster town) to distinguish the characteristics of stores that are components of a commercial district, despite having commonalities such as transportation hubs and high student crowds. As the general perception is that the commercial area in front of Hongdae is a place full of unique stores that cannot be found anywhere else, the number of unique stores can be used as a variable to separate complex causal relationships.

How does the actual calculation work?

The most frustrating part from engineers so far has been the calculation methods that involve inserting all the variables and entering all the data with blind faith that ‘artificial intelligence’ will automatically find the answer. Among them, there is a method called 'stepwise regression', which is a calculation method that repeats inserting and subtracting various variables. Despite warnings from the statistical community that it should be used with caution, many engineers without proper statistics education are unable to use it. Too often I have seen this calculation method used haphazardly and without thinking.

As pointed out above, when linear or non-linear series regression analysis is calculated without eliminating the 'error of simultaneity', which contains complex causal relationships, events in which the effects of variables are over/understated are bound to occur. In this case, data preprocessing must first be performed.

Data preprocessing using instrumental variables is called ‘2-Stage Least Square (2SLS)’ in the data science field. In the first step, complex causal relationships are removed and organized into simple causal relationships, and then in the second step, the general linear or non-linear regression analysis we know is performed.

In the first stage of removal, regression analysis is performed on variables used as explanatory variables using one or several instrumental variables selected above. Returning to the example of the commercial district in front of Hongik University above, young people are the explanatory variables we want to use, and variables related to nearby universities, which are likely to be related to young people but are not expected to be directly related to the commercial district in front of Hongik University, are used. will be. If you perform a regression analysis by dividing the relationship between the number of young people and universities before and after the COVID-19 pandemic period as 0 and 1, you can extract only the part of the young people that is explained by universities. If the variables extracted in this way are used, the relationship between the commercial area in front of Hongdae and young peoplecan be identified through a simple causal relationship rather than the complex causal relationship above.

Failure cases of actual companies in the field

Since there is no actual data, it is difficult to make a short-sighted opinion, but looking at the cases of 'error of simultaneity' that we have encountered so far, if all the data were simply inserted without 2SLS work and linear or non-linear regression analysis was calculated, the area in front of Hongdae is because there are many young people. A great deal of weight is placed on the simple conclusion that the commercial district has expanded, and other than for young people, monthly rent in nearby residential and commercial areas, the presence or absence of unique stores, accessibility near subway and bus stops, etc. will be found to be largely insignificant values. This is because the complex interaction between the two took away the explanatory power that should have been assigned to other variables.

There are cases where many engineering students who have not received proper education in Korea claim that it is a 'conclusion found by artificial intelligence' by relying on tree models and deep learning from the perspective of 'step analysis', which inserts multiple variables at intersections, but there is an explanation structure between variables. There is only a difference in whether it is linear or non-linear, and therefore the explanatory power of the variable is partially modified, but the conclusion is still the same.

The above case is actually perfectly consistent with the mistake made when a credit card company and a telecommunications company jointly analyzed the commercial district in the Mapo-gu area. An official who participated in the study used the expression, 'Collecting young people is the answer,' but then as expected, there was no understanding of the need to use 'instrumental variables'. He simply thought data pre-processing as nothing more than dis-regarding missing data.

In fact, the elements that make up not only Hongdae but also major commercial districts in Seoul are very complex. The reason why young people gather is mostly because the complex components of the commercial district have created an attractive result that attracts people, but it is difficult to find the answer through simple ‘artificial intelligence calculations’ like the above. When trying to point out errors in the data analysis work currently being done in the market, I simply chose 'error of simultaneity', but it also included errors caused by missing important variables (Omitted Variable Bias) and inaccuracies in collected variable data (Attenuation bias by measurement error). It requires quite advanced modeling work that requires complex consideration of such factors.

We hope that students who are receiving incorrect machine learning, deep learning, and artificial intelligence education will learn the above concepts and be able to do rational and systematic modeling.

Research Category

AI/DS Column

Member for

8 months

Real name

Keith Lee

Position

Professor

SNS heavy users have lower income?

Member for

8 months

Real name

Keith Lee

Position

Professor

입력

2023-10-31 03:00

수정

2024-11-26 16:14

One-variable analysis can lead to big errors, so you must always understand complex relationships between various variables.
Data science is a model research project that finds complex relationships between various variables.
Obsessing with one variable is a past way of thinking, and you need to improve your way of thinking in line with the era of big data.

When providing data science speeches, when employees come in with wrong conclusions, or when I give external lectures, the point I always emphasize is not to do 'one-variable regression.'

To give the simplest example, from a conclusion with an incorrect causal relationship, such as, "If I buy stocks, things will fall," to a hasty conclusion based on a single cause, such as women getting paid less than men, immigrants are getting paid less than native citizens, etc. The problem is not solved simply by using a calculation method known as 'artificial intelligence', but you must have a rational thinking structure that can distinguish cause and effect to avoid falling into errors.

SNS heavy users end up with lower wage?

Among the most recent examples I've seen, the common belief that using social media a lot causes your salary to decrease continues to bother me. Conversely, if you use SNS well, you can save on promotional costs, so the salaries of professional SNS marketers are likely to be higher, but I cannot understand why they are applying a story that only applies to high school seniors studying intensively to the salaries of ordinary office workers.

Salary is influenced by various factors such as one's own capabilities, the degree to which the company utilizes those capabilities, the added value produced through those capabilities, and the salary situation of similar occupations. If you leave numerous variables alone and do a 'one-variable regression analysis', you will come to a hasty conclusion that you should quit social media if you want to get a high-paying job.

People may think ‘Analyzing with artificial intelligence only leads to wrong conclusions?’

Is it really so? Below is a structured analysis of this illusion.

Source=Swiss Insitute of Aritifial Intelligence

Problems with one-variable analysis

A total of five regression analyzes were conducted, and one or two more variables listed on the left were added to each. The first variable is whether you are using SNS, the second variable is whether you are a woman and you are using SNS, the third variable is whether you are female, the fourth variable is your age, the fifth variable is the square of your age, and the sixth variable is the number of friends on SNS. all.

The first regression analysis organized as (1) is a representative example of the one-variable regression analysis mentioned above. The conclusion is that using SNS increases salary by 1%. A person who saw the above conclusion and recognized the problem of one-variable regression analysis asked a question about whether women who use SNS are paid less because women use SNS relatively more. In (11.8), we differentiated between those who are female and use SNS and those who are not female and use SNS. The salary of those who are not female and use SNS increased by 1%, and conversely, those who are female and use SNS also increased by 2%. Conversely, wages fell by 18.2%.

Those of you who have read this far may be thinking, 'As expected, discrimination against women is this severe in Korean society.' On the other hand, there may be people who want to separate out whether their salary went down simply because they were women or because they used SNS. .

The corresponding calculation was performed in (3). Those who were not women but used SNS had their salaries increased by 13.8%, and those who were women and used SNS had their salaries increased only by 1.5%, while women's salaries were 13.5% lower. The conclusion is that being a woman and using SNS is a variable that does not have much meaning, while the variable of being given a low salary because of being a woman is a very significant variable.

At this time, a question may arise as to whether age is an important variable, and when age was added in (4), it was concluded that it was not a significant variable. The reason I used the square of age is because people around me who wanted to study ‘artificial intelligence’ raised questions about whether it would make a difference if they used the ‘artificial intelligence’ calculation method, and data such as SNS use and male/female are simply 0/ Because it is 1 data, the result cannot be changed regardless of the model used, while age is not a number divided into 0/1, so it is a variable added to verify whether there is a non-linear relationship between the explanatory variable and the result. This is because ‘artificial intelligence’ calculations are calculations that extract non-linear relationships as much as possible.

Even if we add the non-linear variable called the square of age above, it does not come out as a significant variable. In other words, age does not have a direct effect on salary either linearly or non-linearly.

Finally, when we added more friends in (5), we came to the conclusion that having a large number of friends only had an effect on lowering salary by 5%, and that simply using SNS did not affect salary.

Through the above step-by-step calculation, we can confirm that using SNS does not reduce salary, but that using SNS very hard and focusing more on friendships in the online world has a greater impact on salary reduction. It can also be confirmed that the proportion is only 5% of the total. In fact, the bigger problem is another aspect of the employment relationship expressed by gender.

Numerous one-variable analyzes encountered in everyday life

When I meet a friend in investment banking firms, I sometimes use the expression, ‘The U.S. Federal Reserve raised interest rates, thus stock prices plummeted,’ and when I meet a friend in the VC industry, I use the expression, ‘The VC industry is difficult these days because the number of fund-of-funds has decreased.’

On the one hand, this is true, because it is true that the central bank's interest rate hike and reduction in the supply of policy funds have a significant impact on stock prices and market contraction. However, on the other hand, it is not clear in the conversation how much of an impact it had and whether only the policy variables had a significant impact without other variables having any effect. It may not matter if it simply does not appear in conversations between friends, but if one-variable analysis is used in the same way among those who make policy decisions, it is no longer a simple problem. This is because assuming a simple causal relationship and finding a solution in a situation where numerous other factors must be taken into account, unexpected problems are bound to arise.

U.S. President Truman once said, “I hope someday I will meet a one-armed economist with only one hand.” This is because the economists hired as economic advisors always come up with an interpretation of event A with one hand, while at the same time coming up with an interpretation of way B and necessary policies with the other hand.

From a data science perspective, President Truman requested a one-variable analysis, and consulting economists provided at least a two-variable analysis. And not only does this happen with President Truman of the United States, but conversations with countless non-expert decision makers always involve concerns about delivering the second variable more easily while requesting a first variable solution in the same manner as above. Every time I experience such a reality, I wish the decision maker were smarter and able to take various variables into consideration, and I also think that if I were the decision maker, I would know more and be able to make more rational choices.

Risks of one-variable analysis

It was about two years ago. A new representative from an outsourcing company came and asked me to explain the previously supplied model one more time. The existing model was a graph model based on network theory, a model that explained how multiple words connected to one word were related to each other and how they were intertwined. It is a model that can be useful in understanding public opinion through keyword analysis and helping companies or organizations devise appropriate marketing strategies.

The new person in charge who was listening to the explanation of the model looked very displeased and expressed his dissatisfaction by asking to be informed by a single number whether the evaluation of their main keyword was good or bad. While there are not many words that can clearly capture such likes and dislikes, there are a variety of words that can be used by the person in charge to gauge the phenomenon based on related words, and there is information that can identify the relationship between the words and key keywords, so make use of them. He suggested an alternative.

He insisted until the end and asked me to tell him the number of variable 1, so if I throw away all the related words and look up swear words and praise words in the dictionary and apply them, I will not be able to use even 5% of the total data, and with less than that 5% of data, I explained that assessing likes and dislikes is a very crude calculation.

In fact, at that point, I already thought that this person was looking for an economist with only one hand and was not interested in data-based understanding at all, so I was eager to end the meeting quickly and organize the situation. I was quite shocked when I heard from someone who was with me that he had previously been in charge of data analysis at a very important organization.

Perhaps the work he did for 10 years was to convey to superiors the value of a one-variable organ that creates a simple information value divided into 'positive/negative'. Maybe he understood that the distinction between positive and negative was a crude analysis based on dictionary words, but he was very frustrated when he asked me to come to the same conclusion. In the end, I created a simple pie chart using positive and negative words from the dictionary, but the fact that people who analyze one variable like this have been working as data experts at major organizations for 1 years seems to show the reality in 'AI industry'. It was a painful experience. The world has changed a lot in 1 years, so I hope you can adapt to the changing times.

Research Category

AI/DS Column

Member for

8 months

Real name

Keith Lee

Position

Professor

High accuracy with 'Yes/No' isn't always the best model

Member for

8 months

Real name

Keith Lee

Position

Professor

입력

2023-10-31 03:00

수정

2024-11-26 16:34

With high variance, 0/1 hardly yields a decent model, let alone with new set of data
What is known as 'interpretable' AI is no more than basic statistics
'AI'='Advanced'='Perfect' is nothing more than mis-perception, if not myth

5 years ago. Just not long after an introduction of simple 'artificial intelligence' learning material that uses data related to residential areas in the Boston area to calculate the price of a house or monthly rent using information such as room size and number of rooms was spread through social media. An institution that claims they do hard study in AI together with all kinds of backgrounds in data engineering and data analysis requested me to give a speeach about online targetting ad model with data science.

I was shocked for a moment to learn that such a low-level presentation meeting was being sponsored by a large, well-known company. I saw a SNS post saying that the data was put into various 'artificial intelligence' models, and that the model that fit the best was the 'deep learning' model. That guy showed it off and boasted that they had a group of people with great skills.

I was shocked for a moment to learn that such a low-level presentation meeting was being sponsored by a large, well-known company. I saw a SNS post saying that the data was put into various 'artificial intelligence' models, and that the model that fit the best was the 'deep learning' model. He showed them off and boasted that they had a group of people with great skills.

Back then and now, studies such as putting the models introduced in textbooks into the various calculation libraries provided by Python and finding out which calculation works best are treated as a simple code-run preview task rather than research. I was shocked, but since then, I have seen similar types of papers not only among engineering researchers, but also from medical researchers, and even from researchers in mass communication and sociology. This is one of the things that shows how shockingly the most degree programs in data science are run.

Just because it fits ‘yes/no’ data well doesn’t necessarily mean it’s a good model

The calculation task of matching dichotomous result values classified as 'yes/no' or '0/1' is robustness verification that determines whether the model can repeatedly fit well with similar data rather than the accuracy of the model on the given data. ) must be carried out.

In the field of machine learning, robustness verification as above is performed by separating 'test data' from 'training data'. Although this is not a wrong method, it has the limitation that it is limited to cases where the similarity of the data is continuously repeated. This is a calculation method.

To give an example to make it easier to understand, stock price data is known as data that typically loses similarity. Among the models created by extracting the past year's worth of data and using the data from 1 to 1 months as training data, it is applied to the data from 6 to 7 months. Even if you find the best-fitting model, it is very difficult to obtain the same level of accuracy in the following year or in past data. As a joke among professional researchers, the evaluation of a meaningless calculation is expressed in the following way: “It would be natural to be 12% correct, but it would make sense if the same level of accuracy was 0%.” However, in cases where the similarity is not repeated continuously, ‘ It will help you understand how meaningless a calculation it is to find a model that fits '0/0' well.

Information commonly used as an indicator of data similarity is periodicity, which is used in the analysis of frequency data, etc., and when expressed in high school level mathematics, there are functions such as 'Sine' and 'Cosine'. Unless the data repeats itself periodically in a similar way, you should not expect that you will be able to do it well with new external data just because you are good at distinguishing '0/1' in this verification data.

Such low-repeatability data is called ‘high noise data’ in the field of data science, and instead of using models such as deep learning, known as ‘artificial intelligence’, even at the cost of enormous computer calculation costs, general A linear regression model is used to explain relationships between data. In particular, if the distribution structure of the data is a distribution well known to researchers, such as normal distribution, Poisson distribution, beta distribution, etc., using a linear regression or similar formula-based model can achieve high accuracy without paying computational costs. This is knowledge that has been accepted as common sense in the statistical community since the 1930s, when the concept of regression analysis was established.

Be aware of different appropriate calculation methods for high- and low-variance data

The reason that many engineering researchers in Korea do not know this and mistakenly believe that they can obtain better conclusions by using an 'advanced' calculation method called 'deep learning' is that the data used in the engineering field is 'low-dispersion data' in the form of frequency. This is because, during the degree course, you do not learn how to handle highly distributed data.

In addition, as machine learning models are specialized models for identifying non-linear structures that repeatedly appear in low-variance data, the challenge of generalization beyond '0/1' accuracy is eliminated. For example, among the calculation methods that appear in machine learning textbooks, none of the calculation methods except 'logistic regression' can use the data distribution-based analysis method used for model verification in the statistical community. This is because the variance of the model cannot be calculated in the first place. Academic circles express this as saying that ‘1st moment’ models cannot be used for ‘1nd moment’-based verification. Variance and covariance are commonly known types of ‘second moment’.

Another big problem that arises from such 'first moment'-based calculations is that a reasonable explanation cannot be given for the correlation between each variable.

$${\hat{UGPA}_i} = \underset{1.39}{0.33} + \underset{0.412}{0.094} HGPA_i + \underset{0.15}{0.011} SAT_i - \underset {0.083}{0.026} SK_i $$

Let's take an example.

The above equation is a simple regression equation created to determine how much college GPA (UGPA) is influenced by high school GPA (HGPA), CSAT scores (SAT), and attendance (SK). Putting aside the problems between each variable and assuming that the above equation was calculated reasonably, it can be confirmed that high school GPA influences as much as 41.2% in determining undergraduate GPA, while CSAT scores only influence 15%. there is.

As a result, machine learning calculations based on 'first moment' only focus on how well college grades are matched, and additional model transformation is required to check how much influence each variable has. There are times when you have to give up completely. Even verification of statistics based on 'second moment', which can be performed to verify the accuracy of the calculation, is impossible. If you follow the statistical verification based on the Student-t distribution learned in high school, you can see that 1% and 2% in the above model are both reasonable figures, but machine learning series calculations use similar statistics. Verification is impossible.

Why the expression ‘interpretable artificial intelligence’ appears

You may have seen the expression ‘Interpretable artificial intelligence’ appearing frequently in the media, bookstores, etc. The problem that arises because machine learning models have the blind spot of transmitting only the ‘first moment’ value is that interpretation is impossible. As seen in the above example, it cannot provide reliable answers at the level of existing statistical methodologies to questions such as how deep the relationship between variables is, whether the value of the relationship can be trusted, and whether it appears similarly in new data. Because.

If we go back to a data group supported by a large company that created a website with the title ‘How much Boston house price data have you used?’, if there was even one person among them who knew that models based on machine learning series had the above problems, Could they have confidently said on social media that they have used several models and found 'deep learning' to be the best among them, and sent me an email saying they are experts because they can run the code to that extent?

As we all know, real estate prices are greatly influenced by government policies, as well as the surrounding educational environment and transportation accessibility. Not only is this the case in Korea, but based on my experience living abroad, the situation is not much different in major overseas cities. If I were to be specific, the brand of the apartment seems to be a more influential variable due to its Korean characteristics.

The size of the house, the number of rooms, etc. are meaningful only when other conditions are the same, and other important variables include whether the windows face south, southeast, southwest, plate type, etc. Data on house prices in Boston that were circulating on the Internet at the time were All such core data had disappeared, and it was simply example data that could be used to check whether the code was running well.

If you use artificial intelligence, wouldn't accuracy be 99% or 100% possible?

$$\log({\hat{rent})} = \underset{.043}{.844} + \underset{.066}{.039} \log{(pop)} + \underset{.507}{.039} \log{(avginc)} + \underset{.0056}{.0017} pctstu $$

$$ n = 64, R^2 = .458$$

Another expression I often heard was, “Even if you can’t improve accuracy with statistics, isn’t it possible to achieve 99% or 100% accuracy using artificial intelligence?” Perhaps the ‘artificial intelligence’ that the questioner meant at the time was general. It would have been known as 'deep learning' or 'neural network' models of the same series.

First of all, the model explanatory power of the simple regression analysis above is 45.8%. You can check that the R-squared value above is .458. The question would have been whether this model could be raised to 99% or 100% by using other ‘complex’ and ‘artificial intelligence’ models. The above data is a calculation to determine how much the change in monthly rent in the area near the school is related to population change, change in income per household, and change in the proportion of students. As explained above, knowing that the price of real estate is affected by numerous variables, including government policy, education, and transportation, it is understood that the only surefire way to fit the model with 100% accuracy is to match the monthly rent by monthly rent. It will be. Isn’t finding X by inserting X something that anyone can do?

Other than that, I think there is no need for further explanation as it is common sense that it is impossible to perfectly match the numerous variables that affect monthly rent decisions in a simple way. The area where 99% or 100% accuracy can even be attempted is not social science data, but data that repeatedly produces standardized results in the laboratory, or, to use the expression used above, 'low-variance data'. Typical examples are language data that requires writing sentences that match the grammar, image data that excludes bizarre pictures, and games like Go that require strategies based on rules. Although it is natural that it is impossible to match 99% or 100% of the highly distributed data we encounter in daily life, at one time the basic requirements for all artificial intelligence projects commissioned by the government were 'must use deep learning' and 'must have 100% accuracy.' It was to show '.

Returning to the above equation, we can see that the student population growth rate and the overall population growth rate do not have a significant impact on the monthly rent increase rate, while the income growth rate has a very large impact of up to 50% on the monthly rent increase. In addition, when the overall population growth rate is verified by statistics based on the Student-t distribution learned in high school, the statistic is only about 1.65, so the hypothesis that it is not different from 0 cannot be rejected, so it is a statistically insignificant variable. The conclusion is: Next, the student population growth rate is different from 0, so it can be determined that it is a significant value, but it can be confirmed that it actually has a very small effect of 0.56% on the monthly rent growth rate.

The above computational interpretation is, in principle, impossible using 'artificial intelligence' calculations known as 'deep learning', and a similar analysis requires enormous computational costs and advanced data science research methods. Paying such a large computational cost does not mean that the explanatory power, which was only 45.8%, can be greatly increased. Since the data has already been changed to logarithmic values and only focuses on the rate of change, the non-linear relationship in the data is internalized in a simple regression model. It is done.

Due to a misunderstanding of the model known as 'deep learning', industries made a shameful mistake of paying a very high learning cost and pouring manpower and resources into the wrong research. Based on the simple regression analysis-based example above, ' We hope to recognize the limitations of the computational method known as 'artificial intelligence' and not make the same mistakes as researchers over the past six years.

Research Category

AI/DS Column