Tool does not handle special characters in filename #2

Closed
opened 2025-03-18 00:31:02 -03:00 by schalli110 · 10 comments
schalli110 commented 2025-03-18 00:31:02 -03:00 (Migrated from gitlab.com)

The tool does not work when the file name has special characters, e.g. a comma, or German Umlauts.

The tool does not work when the file name has special characters, e.g. a comma, or German Umlauts.
rogs commented 2025-03-18 08:15:20 -03:00 (Migrated from gitlab.com)

Hey! Thank you for the report :)

Can you give me an example file I can use for my tests? Thanks!

Hey! Thank you for the report :) Can you give me an example file I can use for my tests? Thanks!
schalli110 commented 2025-03-18 11:27:35 -03:00 (Migrated from gitlab.com)

Hey, sure.
Sorry, I assumed it was obvious, but if you don't have access to a German keyboard, it may not be so trivial.

Attached, please find 3 files with umlauts and other special characters in their name.
The SRT files are empty, the error still reproduces.
The zip file also contains the log of me running Get-ChildItem in powershell and piping the names to subscleaner.

By the way, I was mistaken, looks like commas in filenames work.

Thanks,
schalli110

repro.zip

Hey, sure. Sorry, I assumed it was obvious, but if you don't have access to a German keyboard, it may not be so trivial. Attached, please find 3 files with umlauts and other special characters in their name. The SRT files are empty, the error still reproduces. The zip file also contains the log of me running Get-ChildItem in powershell and piping the names to subscleaner. By the way, I was mistaken, looks like commas in filenames work. Thanks, schalli110 [repro.zip](/uploads/f02c7857151d4672b05b1f1ef7a0ef15/repro.zip)
rogs commented 2025-03-18 12:56:53 -03:00 (Migrated from gitlab.com)

That is perfect, thank you very much! I'll work on this and open a PR once it's ready 🫡

That is perfect, thank you very much! I'll work on this and open a PR once it's ready :saluting_face:
rogs commented 2025-03-19 15:33:05 -03:00 (Migrated from gitlab.com)

assigned to @rogs

assigned to @rogs
rogs commented 2025-03-19 15:55:41 -03:00 (Migrated from gitlab.com)

mentioned in merge request !2

mentioned in merge request !2
rogs commented 2025-03-19 15:56:37 -03:00 (Migrated from gitlab.com)

mentioned in commit cf619272d3

mentioned in commit cf619272d3d3f645c0ee6daf44d7d6c7b638c6e0
rogs (Migrated from gitlab.com) closed this issue 2025-03-19 15:56:37 -03:00
rogs commented 2025-03-19 15:57:53 -03:00 (Migrated from gitlab.com)

@schalli110 Looking a little deeper into this issue and your log, it looks like this is a problem with Windows, since I can't reproduce it in Linux or MacOS.

I have pushed a possible fix, but please let me know if this doesn't fix it!

@schalli110 Looking a little deeper into this issue and your log, it looks like this is a problem with Windows, since I can't reproduce it in Linux or MacOS. I have pushed a possible fix, but please let me know if this doesn't fix it!
schalli110 commented 2025-03-20 00:03:17 -03:00 (Migrated from gitlab.com)

Hey Roger,

you're right, this appears to be some sort of Windows or Python stupidity.
On cmd.exe, I can change the codepage to UTF8 and also start subscleaner with python -X utf8 subscleaner.py, then it passes Umlauts correctly.
Unfortunately, Powershell always seems to add the BOM to a UTF8 string, even when I tell it not to, and Python does not like that.

A possible workaround would be to be able to pass the filename as a command line parameter instead of through a pipe.
That seems to work without codepage shennanigans both in cmd.exe and powershell, e.g.:
Get-ChildItem .\Tür.srt | % { python subscleaner.py $_.FullName }

Hey Roger, you're right, this appears to be some sort of Windows or Python stupidity. On cmd.exe, I can change the codepage to UTF8 and also start subscleaner with `python -X utf8 subscleaner.py`, then it passes Umlauts correctly. Unfortunately, Powershell always seems to add the BOM to a UTF8 string, even when I tell it not to, and Python does not like that. A possible workaround would be to be able to pass the filename as a command line parameter instead of through a pipe. That seems to work without codepage shennanigans both in cmd.exe and powershell, e.g.: `Get-ChildItem .\Tür.srt | % { python subscleaner.py $_.FullName }`
schalli110 commented 2025-03-20 00:18:46 -03:00 (Migrated from gitlab.com)

I dug around a bit more and found this:
https://bugs.python.org/issue21927

The workaround is to teach Powershell to not write the BOM:

$OutputEncoding = New-Object System.Text.UTF8Encoding($False)

Then subscleaner works as expected:

Get-ChildItem .\Tür.srt | % { $_.Name } | python -X utf8 .\subscleaner.py

returns

Starting script

Analyzing: Tür.srt

Done

I dug around a bit more and found this: https://bugs.python.org/issue21927 The workaround is to teach Powershell to not write the BOM: ``` $OutputEncoding = New-Object System.Text.UTF8Encoding($False) ``` Then subscleaner works as expected: ``` Get-ChildItem .\Tür.srt | % { $_.Name } | python -X utf8 .\subscleaner.py ``` returns > Starting script > Analyzing: Tür.srt > Done
rogs commented 2025-03-20 08:39:26 -03:00 (Migrated from gitlab.com)

That's perfect. Thank you for debugging!

That's perfect. Thank you for debugging!
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: rogs/subscleaner#2
No description provided.