Friday, June 21, 2024

Robotics Intl

No Result

View All Result

No Result

View All Result

Robotics Intl

No Result

View All Result

GPT-4o’s Chinese token-training data is polluted by spam and porn websites

in Artificial Intelligence

Reading Time: 2 mins read

Share on Facebook Share on Twitter

You might also like

Eliminating Vector Quantization: Diffusion-Based Autoregressive AI Models for Image Generation

Voyage Multilingual 2 Embedding Evaluation | by Lars Wiik | Jun, 2024

Eric Evans receives Department of Defense Medal for Distinguished Public Service | MIT News

The brand new tokenizer has 200,000 tokens in complete, and about 25% are in non-English languages, says Deedy Das, an AI investor at Menlo Ventures. He used language filters to rely the variety of tokens in numerous languages, and the highest languages, in addition to English, are Russian, Arabic, and Vietnamese.

“So the tokenizer’s important affect, for my part, is you get the price down in these languages, not that the standard in these languages goes dramatically up,” Das says. When an LLM has higher and longer tokens in non-English languages, it might analyze the prompts sooner and cost customers much less for a similar reply. With the brand new tokenizer, “you’re nearly 4 occasions value discount,” he says.

Das, who additionally speaks Hindi and Bengali, took a take a look at the longest tokens in these languages. The tokens replicate discussions taking place in these languages, in order that they embrace phrases like “Narendra” or “Pakistan,” however widespread English phrases like “Prime Minister,” “college,” and “worldwide” additionally come up often. In addition they don’t exhibit the problems surrounding the Chinese language tokens.

That probably displays the coaching information in these languages, Das says: “My working idea is the web sites in Hindi and Bengali are very rudimentary. It’s like [mostly] information articles. So I’d anticipate this to be the case. There usually are not many spam bots and porn web sites making an attempt to occur in these languages. It’s largely going to be in English.”

Polluted information and an absence of cleansing

Nonetheless, issues are drastically completely different in Chinese language. In keeping with a number of researchers who’ve seemed into the brand new library of tokens used for GPT-4o, the longest tokens in Chinese language are nearly solely spam phrases utilized in pornography, playing, and scamming contexts. Even shorter tokens, like three-character-long Chinese language phrases, replicate these matters to a big diploma.

“The issue is evident: the corpus used to coach [the tokenizer] will not be clear. The English tokens appear nice, however the Chinese language ones usually are not,” says Cai from Princeton College. It’s not uncommon for a language mannequin to crawl spam when amassing coaching information, however often there will probably be vital effort taken to wash up the info earlier than it’s used. “It’s attainable that they didn’t do correct information clearing on the subject of Chinese language,” he says.

The content material of those Chinese language tokens might recommend that they’ve been polluted by a particular phenomenon: web sites hijacking unrelated content material in Chinese language or different languages to spice up spam messages.

These messages are sometimes commercials for pornography movies and playing web sites. They could possibly be actual companies or merely scams. And the language is inserted into content material farm web sites or generally respectable web sites to allow them to be listed by engines like google, circumvent the spam filters, and are available up in random searches. For instance, Google listed one search outcome web page on a US Nationwide Institutes of Well being web site, which lists a porn web site in Chinese language. The identical web site title additionally appeared in a minimum of 5 Chinese language tokens in GPT-4o.

Tags: #chinese Data GPT4os polluted porn spam tokentraining websites

AI Chatbots Are Promising but Limited in Promoting Healthy Behavior Change

Calian cuts time to market for its Smart Antenna by two-thirds with Point One Navigation

Recommended For You

Eliminating Vector Quantization: Diffusion-Based Autoregressive AI Models for Image Generation

by Robotics Intl

Autoregressive picture technology fashions have historically relied on vector-quantized representations, which introduce a number of important challenges. The method of vector quantization is computationally intensive and infrequently leads...

Read more

Voyage Multilingual 2 Embedding Evaluation | by Lars Wiik | Jun, 2024

by Robotics Intl

In comparison with OpenAI, Cohere, Google, and E5Picture generated by the creator utilizing ChatGPTThe race to create one of the best AI instruments is heating up, pushed by...

Read more

Eric Evans receives Department of Defense Medal for Distinguished Public Service | MIT News

by Robotics Intl

On Might 31, the U.S. Division of Protection's chief expertise officer, Beneath Secretary of Protection for Analysis and Engineering Heidi Shyu, offered Eric Evans with the Division of...

Read more

Imperva optimizes SQL generation from natural language using Amazon Bedrock

by Robotics Intl

This can be a visitor put up co-written with Ori Nakar from Imperva. Imperva Cloud WAF protects a whole lot of 1000's of internet sites towards cyber threats...

Read more

AI in Manufacturing: Overcoming Data and Talent Barriers

by Robotics Intl

Synthetic Intelligence (AI) is more and more turning into the muse of recent manufacturing with unprecedented effectivity and innovation. Think about manufacturing strains that regulate themselves in actual...

Read more

Next Post

Calian cuts time to market for its Smart Antenna by two-thirds with Point One Navigation

Syslogic adds RTK capability to AI computer for localization

The Physics Behind Data. How physics principles give us deeper… | by Tim Lou, PhD | May, 2024

Leave a Reply Cancel reply

Instagram LinkedIn Twitter Youtube

The latest updates and stories from Robotics Technology around the world: Robotics, AI, Machine Learning, Robotic Markets, Development Updates and more... Robotics Intl keeps you in the loop.

CATEGORIES

No Result

View All Result

SITE MAP

Copyright © 2023 Robotics Intl.
Robotics Intl is not responsible for the content of external sites.

No Result

View All Result

Copyright © 2023 Robotics Intl.
Robotics Intl is not responsible for the content of external sites.