Skip to content
Snippets Groups Projects
Commit 6f582fd7 authored by Tiphaine Viard's avatar Tiphaine Viard
Browse files

Fixed bug in preprocessing (logical or)

parent b3471433
No related branches found
No related tags found
No related merge requests found
......@@ -27,8 +27,8 @@ for i, filename in enumerate(tqdm(glob.glob('txts/*.txt'))):
tokens = word_tokenize(lines)
# Remove tokens with length < 3, not a link and not in stop words
tokens = (' ').join([t.lower() for t in tokens
if len(t) > 3
and t.isalpha()
if len(t) >= 3
and (t.isalpha() or t in "!\"#$%&'()*+,-./:;<=>?@[\]^_`{|}~")
and t.lower() not in stop_words
and not "http" in t.lower()
])
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment