New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spell check feature #3688
Comments
Hi @strarsis, thanks for opening this issue! I am wondering if you could expand a bit on the use case. OpenRefine is primarily designed for batch transformations, not manual editing of cells, so it is not clear what sorts of workflows you are thinking here. If this is about adding spell-checking to the cell editing dialog, that could be done with the browser's native spell-checker, without us needing to ship anything special. Also I would be interested to understand better how this would relate to clustering. |
@wetneb: So when there is a large list of data, e.g. products which all got lengthy description texts, automatic spell check would be great. This is also somehow related to clustering: As words should be spelled consistently, the means to merge similar words with their proper spelling, very similar to clustering of facets, would be very useful. |
This is still very vague to me. Can you describe in more detail how you would use this feature? |
For example: The production description may domain-specific words, e.g. "Jacquard" (fabric/cloth type). And there are different variations of misspellings, like "Jaquard" and "jaquard", "jacquard" or even "jquard". The similarity-feature of the clustering feature would be very helpful. It can find all these occurences in full text itself (using the words as facets, but this may not be enough as the words could also have spaces missing, e.g. "wordJaquard"), allowing a fast and efficient correction of all these spelling mistakes. |
Ok I think I can roughly see your request, I would rephrase it like this: implement a variant of the clustering feature, which works at the word (or phrase) level instead of whole cell-level value. If we go for word-level that would roughly amount to splitting cell values by whitespace, clustering them, and then joining them back again. Would that workaround work for you? |
@wetneb: Hm, this could be a makeshift solution. Though grammar and syntax (comma, point) and sentence parts wouldn't be taken into account. |
For me "spellcheck" and "word clustering" seem distinct tasks. Spellcheck suggests checking words against some existing dictionary, identifying words that do not exist in the dictionary and then either correcting the word, ignoring the word or adding the word to the dictionary so future occurrences of the word are accepted as correct. Spell checking is obviously language dependent - you need dictionaries for different languages. Word clustering seems to be about saying "for this body of text, are there words that are similar enough that the same original word might have been meant"? Spellcheck seems like it might be do-able but I feel like it's on the edge of the intended functionality of OpenRefine. My first thought is that an Extension might be the best approach to adding this type of functionality. Word clustering seems like it could be very hard. Generally, the shorter the strings you are trying cluster, the more likely it is you'll see false positives in clustering, and as we increase the number of strings to be compared the number of comparisons that need to be made for nearest neighbour clustering increases at a higher rate - which makes it a difficult problem from a technical perspective. Finally, based on the example given by @strarsis above
It seems that there could be an overlap with Named Entity Recognition rather than spelling - perhaps part of the question here is whether the aim is to identify the occurrence of an entity, or just correct some spelling? I'm still a bit unclear on whether the point here is about "correcting" text or if it's about finding the occurrence of particular words in bodies of text where we know that the words might be mis-spelt or otherwise garbled. "Correcting" spelling is a different problem to "finding words which might be X" I think. For me an example data set with an example of how you'd hope to manipulate in OpenRefine and what the desired output would be really helpful in understanding what's required here |
I think an extension, similar to the Named Entity Recognition extension, is the right place for this. It's definitely natural language specific. The functionality still needs more specification though. Some possible modes of operation for a GREL function:
Users would probably want control over a user dictionary in addition to the system supplied dictionary. If this isn't a GREL function, but an interactive UI facility, it could be a lot more graphical and interactive with the user, but it would also be pretty far outside the current interaction model. |
That feature would be very useful! If this is too tangential to the intended featureset of OpenRefine, are there tools that implement this kind of functionality? |
"That feature" being what? We've discussed a number of different things and none of them are very well specified.
There is a Named Entity Recognition extension for OpenRefine, which I'm pretty sure is listed on our website, although I don't know if it's currently being maintained. You can also do this using the "Fetch URL" operation to ping a RESTful API. |
@tfmorris: Apparently it isn't maintained anymore (archived). But it could still work. And now I also know how this feature is called. |
Github is a distributed ecosystem, so even if the original project is no longer being maintained, one can often find another fork which has activity. You can visualize the network of forks here: https://github.com/RubenVerborgh/Refine-NER-Extension/network It looks like this fork received updates as recently as January: https://github.com/stkenny/Refine-NER-Extension |
@tfmorris: Thanks! Does the detection require a language-specific dictionary/model? |
Named Entity Recognition is both natural language specific and, to some extent, domain specific. That OpenRefine NER extension uses the Stanford CoreNLP software which has models for these languages: https://stanfordnlp.github.io/CoreNLP/human-languages.html |
@tfmorris: This is awesome! Thanks! |
If that addresses the problem perhaps we could close this issue? |
@wetneb: Yes! |
We've had another request for spell check, so reopening to host the discussion. |
@grahamjevon: I found a quite primitive but efficient approach: |
Proposed solution
An integrated spell-checker (OpenSource/free if possible) for cells.
Alternatives considered
Exporting all the data and then using a different, external program.
Additional context
This is similar to Clustering but rather for full text fields.
The text was updated successfully, but these errors were encountered: