أفضل الطرق لإزالة علامات الترقيم من النصوص في Python و JavaScript

تم التحديث: ٢٧ مارس ٢٠٢٦

#python #javascript #string-manipulation #recreated

Best Ways to Strip Punctuation From Strings in Python and Javascript

ملخص

Python: استخدم str.translate() مع str.maketrans() للسرعة، أو re.sub() للتحكم في Unicode، أو string.punctuation للبساطة. JavaScript: استخدم .replace() مع التعبير النمطي (regex) /[^\w\s]/g لترميز ASCII، أو Intl.Segmenter لإزالة علامات الترقيم المتوافقة مع Unicode. اختر الطريقة بناءً على احتياجات الأداء ونطاق الأحرف.

تعد إزالة علامات الترقيم مهمة شائعة في المعالجة المسبقة للغات الطبيعية (NLP)، والتحقق من صحة النماذج، وتحليل النصوص، وتنظيف البيانات. تعتمد الطريقة "الأفضل" على بياناتك (ASCII مقابل Unicode)، ومتطلبات الأداء (السرعة مقابل سهولة القراءة)، وميزات اللغة. يقارن هذا الدليل المرجعي بين الطرق في Python و JavaScript مع معايير الأداء ومعالجة الحالات الخاصة.

Python: الطرق مرتبة حسب السرعة

1. str.translate() + str.maketrans() (الأسرع)

import string

text = "Hello, world! How are you?"
translator = str.maketrans('', '', string.punctuation)
result = text.translate(translator)
print(result)  # "Hello world How are you"

الأداء: الأسرع المميزات: أسرع طريقة في Python الخام، تتعامل مع جميع علامات ترقيم ASCII العيوب: تقتصر على علامات الترقيم المحددة مسبقاً، ولا تزيل علامات ترقيم Unicode

الحالات الخاصة:

# تتعامل مع علامات ترقيم Unicode الموجودة في string.punctuation
text = "café…hello"
result = text.translate(translator)
print(result)  # "caféhello" (تمت إزالة …، والحفاظ على é)

# لكنها تغفل علامات ترقيم Unicode الأخرى:
text = "Hello «world» and ‹greetings›"  # علامات اقتباس فرنسية
result = text.translate(translator)
print(result)  # "Hello «world» and ‹greetings›" (لم تتم إزالتها)

2. التعبيرات النمطية مع re.sub() (الأكثر مرونة)

import re

text = "Hello, world! How are you?"
result = re.sub(r'[^\w\s]', '', text)
print(result)  # "Hello world How are you"

تحليل النمط:

[^\w\s] = يطابق أي شيء ليس حرف كلمة (\w) أو مسافة بيضاء (\s)
أحرف الكلمات = a-z، A-Z، 0-9، والشرطة السفلية (_)

الأداء: متوسط (أبطأ، لكن تحكم أكبر) المميزات: متوافق مع Unicode، أنماط قابلة للتخصيص، يزيل جميع علامات الترقيم العيوب: أبطأ من translate()، ومبالغ فيه لترميز ASCII البسيط

لعلامات ترقيم Unicode:

# إزالة جميع علامات ترقيم Unicode (بما في ذلك الحركات والأحرف الخاصة)
text = "Hello, café! «World»"
result = re.sub(r'[^\w\s]', '', text, flags=re.UNICODE)
print(result)  # "Hello café World"

# إزالة علامات الترقيم مع الحفاظ على الحركات:
result = re.sub(r'[^\w\s\u0080-\uFFFF]', '', text)
# يحافظ على الأحرف غير التابعة لـ ASCII (الأحرف المشكلة)، ويزيل علامات الترقيم

3. مجموعة مخصصة من علامات الترقيم

import string

PUNCTUATION_TO_REMOVE = set(string.punctuation)

def strip_punctuation(text: str) -> str:
    """إزالة علامات ترقيم ASCII"""
    return ''.join(char for char in text if char not in PUNCTUATION_TO_REMOVE)

text = "Hello, world! How are you?"
result = strip_punctuation(text)
print(result)  # "Hello world How are you"

الأداء: الأبطأ المميزات: مقروءة، قابلة للتخصيص، حتمية العيوب: الأبطأ بين الطرق الثلاث، وتتعامل فقط مع ASCII

قابلة للتخصيص:

# إزالة علامات ترقيم محددة فقط
REMOVE_ONLY = {',', '!', '?'}

def strip_select(text: str) -> str:
    return ''.join(char for char in text if char not in REMOVE_ONLY)

text = "Hello, world! How are you?"
result = strip_select(text)
print(result)  # "Hello world How are you"

# الحفاظ على بعض علامات الترقيم:
text = "Price: $19.99—amazing!"
result = strip_select(text)
print(result)  # "Price: $19.99—amazing" (يحافظ على $ و .)

4. معالجة Unicode المعقدة (unicodedata)

import unicodedata

def remove_unicode_punctuation(text: str) -> str:
    """إزالة فئة Unicode لعلامات الترقيم (P*)"""
    return ''.join(
        char for char in text
        if unicodedata.category(char)[0] != 'P'
    )

text = "Hello, world! «Café»…"
result = remove_unicode_punctuation(text)
print(result)  # "Hello world Café"

فئات Unicode:

Pc = علامات ترقيم رابطة (الشرطة السفلية)
Pd = علامات ترقيم واصلة (-, –, —)
Pe = علامات ترقيم غالق (]})
Pf = علامات ترقيم نهائية (»)
Pi = علامات ترقيم ابتدائية («)
Po = علامات ترقيم أخرى (!, ?, .)
Ps = علامات ترقيم فاتحة (([{)

المميزات: تتعامل مع أي علامات ترقيم Unicode بشكل صحيح، ومستقلة عن اللغة العيوب: أبطأ طريقة، ومبالغ فيها لترميز ASCII

JavaScript: الطرق مرتبة حسب السرعة

1. التعبيرات النمطية مع replace() (القياسية)

const text = "Hello, world! How are you?";
const result = text.replace(/[^\w\s]/g, '');
console.log(result);  // "Hello world How are you"

النمط:

/[^\w\s]/g = إزالة جميع الأحرف التي ليست كلمات وليست مسافات
علامة g = عام (استبدال جميع التكرارات)

الأداء: سريع المميزات: سريع، موجز، يتعامل مع معظم الحالات العيوب: لا يتعامل مع علامات ترقيم Unicode بشكل جيد

لـ Unicode:

// ASCII فقط (يحافظ على أحرف Unicode، ويزيل علامات الترقيم):
const text = "Hello, café! «World»";
const result = text.replace(/[^\w\s]/gu, '');
console.log(result);  // "Hello café World" (علامة u = Unicode)

// إزالة جميع الأحرف التي ليست حروفاً:
const result2 = text.replace(/[^\p{Letter}\p{Number}\s]/gu, '');
console.log(result2);  // "Hello café World"

2. replace() مع Unicode Property Escapes (ES2024)

// إزالة جميع علامات ترقيم Unicode
const text = "Hello, world! «Café»";
const result = text.replace(/\p{P}/gu, '');
console.log(result);  // "Hello world Café"

خصائص Unicode:

\p{P} = أي علامة ترقيم
\p{Punctuation} = نفس \p{P}
\p{Pc} = علامات ترقيم رابطة
\p{Pd} = علامات ترقيم واصلة
\p{Po} = علامات ترقيم أخرى

المميزات: دقيقة، تتعامل مع جميع علامات ترقيم Unicode العيوب: تتطلب دعم متصفح ES2024+

3. وظيفة مخصصة (مقروءة)

function stripPunctuation(text) {
  const punctuation = /[^\w\s]/g;
  return text.replace(punctuation, '');
}

const text = "Hello, world! How are you?";
const result = stripPunctuation(text);
console.log(result);  // "Hello world How are you"

4. إزالة علامات ترقيم محددة فقط

function removeSpecificPunctuation(text, charsToRemove = ",.!?") {
  const pattern = new RegExp(`[${charsToRemove}]`, 'g');
  return text.replace(pattern, '');
}

const text = "Price: $19.99—amazing!";
const result = removeSpecificPunctuation(text, '!?');
console.log(result);  // "Price: $19.99—amazing"

جدول مقارنة الأداء

اللغة	الطريقة	الأداء	دعم Unicode	الأفضل لـ
Python	translate()	الأسرع	ASCII فقط	السرعة القصوى، بيانات ASCII
Python	re.sub()	متوسط	Unicode كامل	المرونة، نصوص Unicode
Python	Set comprehension	بطيء	ASCII فقط	سهولة القراءة، قواعد مخصصة
Python	unicodedata	الأبطأ	Unicode كامل	معالجة Unicode دقيقة
JavaScript	replace(/regex/)	سريع	ASCII/Unicode محدود	الاستخدام العام
JavaScript	replace(/\p{P}/gu)	متوسط	Unicode كامل	مشاريع ES2024+

حالات استخدام عملية

حالة الاستخدام 1: المعالجة المسبقة لـ NLP (إزالة جميع علامات الترقيم)

# Python
import re

texts = ["Hello, world!", "Price: $19.99", "Hello…goodbye"]
cleaned = [re.sub(r'[^\w\s]', '', text) for text in texts]
# ["Hello world", "Price 1999", "Helloody"]

// JavaScript
const texts = ["Hello, world!", "Price: $19.99", "Hello…goodbye"];
const cleaned = texts.map(text => text.replace(/[^\w\s]/gu, ''));
// ["Hello world", "Price 1999", "Helloody"]

حالة الاستخدام 2: الحفاظ على الفواصل العليا (الاختصارات)

import re

def keep_apostrophes(text):
    return re.sub(r"[^\w\s']", '', text)

text = "It's don't valid, isn't it?"
result = keep_apostrophes(text)
print(result)  # "It's don't valid isn't it"

function keepApostrophes(text) {
  return text.replace(/[^\w\s']/gu, '');
}

const text = "It's don't valid, isn't it?";
const result = keepApostrophes(text);
console.log(result);  // "It's don't valid isn't it"

حالة الاستخدام 3: إزالة علامات الترقيم مع الحفاظ على الشرطات (للكلمات الموصولة)

import re

def keep_hyphens(text):
    return re.sub(r"[^\w\s-]", '', text)

text = "well-known, mother-in-law: important!"
result = keep_hyphens(text)
print(result)  # "well-known mother-in-law important"

function keepHyphens(text) {
  return text.replace(/[^\w\s-]/gu, '');
}

const text = "well-known, mother-in-law: important!";
const result = keepHyphens(text);
console.log(result);  // "well-known mother-in-law important"

حالة الاستخدام 4: إزالة علامات الترقيم باستثناء النقاط (للحفاظ على الجمل)

import re

def keep_periods(text):
    return re.sub(r"[^\w\s.]", '', text)

text = "Hello, world! How are you? I'm fine."
result = keep_periods(text)
print(result)  # "Hello world How are you I'm fine."

الحالات الخاصة والمشاكل المحتملة

الأرقام في علامات الترقيم

import string

# string.punctuation لا تتضمن الأرقام
print(string.punctuation)
# Output: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

# لكن \w تتضمن الأرقام، لذا [^\w\s] تزيل جميع علامات الترقيم باستثناء الأرقام
text = "Price: $19.99!"
result = re.sub(r"[^\w\s]", '', text)
print(result)  # "Price 1999"  (يحافظ على 1999)

result = re.sub(r"[^\d\s]", '', text)
print(result)  # "  1999"  (يحافظ على الأرقام فقط)

الإيموجي والأحرف الخاصة

import re

text = "Hello 👋 world! 🌍 How are you? 😊"

# إزالة ASCII فقط:
result = re.sub(r"[^\w\s]", '', text)
print(result)  # "Hello  world  How are you  " (الإيموجي لا يزال موجوداً في Python 3.12+)

# مع علامة Unicode:
result = re.sub(r"[^\w\s]", '', text, flags=re.UNICODE)
print(result)  # لا يزال يحافظ على الإيموجي (لأنها ليست علامات ترقيم)

# إزالة الإيموجي وعلامات الترقيم:
result = re.sub(r"[^\w\s\p{L}]", '', text)  # لن يعمل، \p ليست في Python re
# استخدم unicodedata بدلاً من ذلك
import unicodedata
result = ''.join(
    c for c in text
    if unicodedata.category(c)[0] not in ('P', 'So')  # إزالة علامات الترقيم والرموز/أخرى
)
print(result)  # "Hello  world  How are you"

نصوص RTL (العربية، العبرية)

# بايثون تتعامل مع RTL بشكل صحيح في التعبيرات النمطية
import re

text = "مرحبا, بالعالم! Hello, world!"
result = re.sub(r'[^\w\s]', '', text)
print(result)  # "مرحبا بالعالم Hello world"

// JavaScript تتعامل أيضاً مع RTL باستخدام علامة Unicode
const text = "مرحبا, بالعالم! Hello, world!";
const result = text.replace(/[^\w\s]/gu, '');
console.log(result);  // "مرحبا بالعالم Hello world"

قياس الأداء على بياناتك الخاصة

import timeit
import string
import re

text = "Hello, world! " * 100  # 1400 chars

# الطريقة 1: translate
def m1():
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

# الطريقة 2: regex
def m2():
    return re.sub(r'[^\w\s]', '', text)

# الطريقة 3: set comprehension
def m3():
    return ''.join(c for c in text if c not in string.punctuation)

print("translate():", timeit.timeit(m1, number=10000))
print("re.sub():", timeit.timeit(m2, number=10000))
print("set comprehension:", timeit.timeit(m3, number=10000))

const text = "Hello, world! ".repeat(100);  // 1400 chars

console.time("regex");
for (let i = 0; i < 10000; i++) {
  text.replace(/[^\w\s]/g, '');
}
console.timeEnd("regex");

console.time("regex unicode");
for (let i = 0; i < 10000; i++) {
  text.replace(/[^\w\s]/gu, '');
}
console.timeEnd("regex unicode");

أفضل الممارسات

استخدم str.translate() إذا كنت:

تتعامل مع نصوص ASCII فقط
الأداء أمر بالغ الأهمية (معالجة ملايين السلاسل النصية)
تستخدم Python مع string.punctuation القياسية

استخدم re.sub() إذا كنت:

تحتاج لدعم علامات ترقيم Unicode
تريد مطابقة أنماط مرنة
الأداء مقبول (أقل من مليون سلسلة نصية)

استخدم unicodedata إذا كنت:

تحتاج
إزالة علامات الترقيم بسيطة في المفهوم ولكنها دقيقة في التنفيذ. بالنسبة لـ ASCII فقط، فإن translate() في Python لا يُعلى عليها. بالنسبة لـ Unicode، فإن re.sub() تتميز بالمرونة. أما .replace() الخاصة بـ JavaScript فتتعامل مع كليهما باستخدام regex. اختر بناءً على نوع بياناتك، واحتياجات الأداء، وميزات اللغة المتاحة. اختبر على بياناتك الفعلية — تختلف معايير الأداء بشكل كبير بناءً على طول السلسلة النصية وكثافة علامات الترقيم.