Spell check feature #3688

strarsis · 2021-03-02T16:46:49Z

Proposed solution

An integrated spell-checker (OpenSource/free if possible) for cells.

Alternatives considered

Exporting all the data and then using a different, external program.

Additional context

This is similar to Clustering but rather for full text fields.

wetneb · 2021-03-03T12:54:32Z

Hi @strarsis, thanks for opening this issue!

I am wondering if you could expand a bit on the use case. OpenRefine is primarily designed for batch transformations, not manual editing of cells, so it is not clear what sorts of workflows you are thinking here. If this is about adding spell-checking to the cell editing dialog, that could be done with the browser's native spell-checker, without us needing to ship anything special.

Also I would be interested to understand better how this would relate to clustering.

strarsis · 2021-03-03T15:38:00Z

@wetneb: So when there is a large list of data, e.g. products which all got lengthy description texts, automatic spell check would be great. This is also somehow related to clustering: As words should be spelled consistently, the means to merge similar words with their proper spelling, very similar to clustering of facets, would be very useful.

wetneb · 2021-03-03T15:41:48Z

So when there is a large list of data, e.g. products which all got lengthy description texts, automatic spell check would be great

This is still very vague to me. Can you describe in more detail how you would use this feature?

strarsis · 2021-03-03T15:51:41Z

For example: The production description may domain-specific words, e.g. "Jacquard" (fabric/cloth type). And there are different variations of misspellings, like "Jaquard" and "jaquard", "jacquard" or even "jquard". The similarity-feature of the clustering feature would be very helpful. It can find all these occurences in full text itself (using the words as facets, but this may not be enough as the words could also have spaces missing, e.g. "wordJaquard"), allowing a fast and efficient correction of all these spelling mistakes.

wetneb · 2021-03-03T19:40:50Z

Ok I think I can roughly see your request, I would rephrase it like this: implement a variant of the clustering feature, which works at the word (or phrase) level instead of whole cell-level value. If we go for word-level that would roughly amount to splitting cell values by whitespace, clustering them, and then joining them back again. Would that workaround work for you?

strarsis · 2021-03-03T20:01:28Z

@wetneb: Hm, this could be a makeshift solution. Though grammar and syntax (comma, point) and sentence parts wouldn't be taken into account.

ostephens · 2021-03-05T11:48:18Z

For me "spellcheck" and "word clustering" seem distinct tasks.

Spellcheck suggests checking words against some existing dictionary, identifying words that do not exist in the dictionary and then either correcting the word, ignoring the word or adding the word to the dictionary so future occurrences of the word are accepted as correct. Spell checking is obviously language dependent - you need dictionaries for different languages.

Word clustering seems to be about saying "for this body of text, are there words that are similar enough that the same original word might have been meant"?

Spellcheck seems like it might be do-able but I feel like it's on the edge of the intended functionality of OpenRefine. My first thought is that an Extension might be the best approach to adding this type of functionality.

Word clustering seems like it could be very hard. Generally, the shorter the strings you are trying cluster, the more likely it is you'll see false positives in clustering, and as we increase the number of strings to be compared the number of comparisons that need to be made for nearest neighbour clustering increases at a higher rate - which makes it a difficult problem from a technical perspective.

Finally, based on the example given by @strarsis above

The production description may domain-specific words, e.g. "Jacquard" (fabric/cloth type)

It seems that there could be an overlap with Named Entity Recognition rather than spelling - perhaps part of the question here is whether the aim is to identify the occurrence of an entity, or just correct some spelling? I'm still a bit unclear on whether the point here is about "correcting" text or if it's about finding the occurrence of particular words in bodies of text where we know that the words might be mis-spelt or otherwise garbled. "Correcting" spelling is a different problem to "finding words which might be X" I think.

For me an example data set with an example of how you'd hope to manipulate in OpenRefine and what the desired output would be really helpful in understanding what's required here

tfmorris · 2021-03-05T18:43:59Z

I think an extension, similar to the Named Entity Recognition extension, is the right place for this. It's definitely natural language specific.

The functionality still needs more specification though. Some possible modes of operation for a GREL function:

return a list of words that are (potentially) misspelled in the input text
return the input text with all misspellings corrected (above a certain user specified confidence threshold?)
return a list of alternative suggestions for a misspellings in the input text (this would require some type of structured output)

Users would probably want control over a user dictionary in addition to the system supplied dictionary.

If this isn't a GREL function, but an interactive UI facility, it could be a lot more graphical and interactive with the user, but it would also be pretty far outside the current interaction model.

strarsis · 2021-03-05T19:35:16Z

That feature would be very useful! If this is too tangential to the intended featureset of OpenRefine, are there tools that implement this kind of functionality?

tfmorris · 2021-03-05T22:40:57Z

That feature would be very useful!

"That feature" being what? We've discussed a number of different things and none of them are very well specified.

If this is too tangential to the intended featureset of OpenRefine, are there tools that implement this kind of functionality?

There is a Named Entity Recognition extension for OpenRefine, which I'm pretty sure is listed on our website, although I don't know if it's currently being maintained. You can also do this using the "Fetch URL" operation to ping a RESTful API.

strarsis · 2021-03-06T14:07:24Z

@tfmorris: Apparently it isn't maintained anymore (archived). But it could still work. And now I also know how this feature is called.

tfmorris · 2021-03-06T16:48:11Z

Github is a distributed ecosystem, so even if the original project is no longer being maintained, one can often find another fork which has activity. You can visualize the network of forks here: https://github.com/RubenVerborgh/Refine-NER-Extension/network It looks like this fork received updates as recently as January: https://github.com/stkenny/Refine-NER-Extension

strarsis · 2021-03-06T19:15:08Z

@tfmorris: Thanks! Does the detection require a language-specific dictionary/model?

tfmorris · 2021-03-06T19:49:23Z

Named Entity Recognition is both natural language specific and, to some extent, domain specific. That OpenRefine NER extension uses the Stanford CoreNLP software which has models for these languages: https://stanfordnlp.github.io/CoreNLP/human-languages.html

strarsis · 2021-03-06T19:58:33Z

@tfmorris: This is awesome! Thanks!

wetneb · 2021-03-07T08:55:02Z

If that addresses the problem perhaps we could close this issue?

strarsis · 2021-03-07T15:08:04Z

@wetneb: Yes!

tfmorris · 2021-03-09T18:19:36Z

We've had another request for spell check, so reopening to host the discussion.

grahamjevon · 2021-04-01T15:16:47Z

I've had a similar request on my mind.

Problem

The problem I have is that I have columns that contain long strings of text (e.g. verbose catalogue descriptions in the form of multiple sentences). This data is created in Excel (often by people for whom English is a second language) and often contains spelling errors because Excel does not have a spell check function like Word.

Solution

A spell check and edit function which opens a new window similar to the cluster and edit function. Instead of clusters of similar cells being presented to the user, a list of unknown words will be presented to the users (i.e. a list of all words in the selected column, which are not found in a dictionary).

The user will then have the option to tick the words they want to change and to enter the replacement term into an empty field. When the user clicks "Confirm", OpenRefine will essentially apply the replace function to all of the the unknown words where the checkbox has been selected.

Current alternatives

I could do something similar outside of Open Refine using pyspellchecker. But it would be easier to have an extra cleaning function in OpenRefine while I'm cleaning the data in other ways. And this would be more accessible to people not familiar with Python.

Possible configuration features

Just as the cluster function uses different clustering methods, perhaps the spell check function could allow the user to select different dictionaries. Not least because different users will use different languages.
It would probably also be useful for different word boundary methods to be available for selection. e.g. \b\w*\b or value.split(' '). I'm sure there are others that would be even more useful. I imagine different users would want to define words in different ways.
Perhaps it might be useful for users to be able to establish their own dictionary or list of stop words.
It might be useful to include a filter within the spell check window (e.g. if the column contains English and Arabic descriptions, it might be helpful if the user could filter out any words that don't contain Latin characters.
I'm not sure if I would want the spell checker to suggest a replacement word, but maybe some use cases would benefit from this.
Case sensitivity is another consideration. To reduce the size of the list it might be useful to have words with the same spellings, but different cases, grouped as one word. But the replace feature would probably want to maintain the same case pattern. It might be a bit premature to think about this in detail, but it might be useful to have some degree of flexibility over case sensitivity for different use cases.
It would probably useful to apply this to more than one column at a time (to reduce duplication of effort if there is more than one column with text requiring spell checking.
I wonder whether it would also be useful to have the option to cluster similar unknown words, but maybe that is making things too complex. An example of where this would be useful would be in a column where the same word has being misspelled in different ways (e.g. "Administration" is misspelled as "Adminstration" and "Administraton")

Here is a very rough mock up based on the cluster and edit template.

strarsis · 2021-04-01T15:20:38Z

@grahamjevon: I found a quite primitive but efficient approach:
Export to Excel-compatible format, open in Excel (or Libre/OpenOffice Calc), run the spell-checker.
It also allows adding to dictionary and changing all occurrences. It works quite well.
After correction the excel can be imported back into OpenRefine for further clustering.

strarsis added Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements. Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators labels Mar 2, 2021

wetneb added clustering Issues related to the clustering operation, to merge similar values in a text column and removed Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators labels Mar 3, 2021

strarsis closed this as completed Mar 7, 2021

tfmorris changed the title ~~Dictionary/spell check?~~ Spell check feature Mar 9, 2021

tfmorris reopened this Mar 9, 2021

wetneb mentioned this issue Oct 16, 2022

Introduce extension point for cell rendering. #5312

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spell check feature #3688

Spell check feature #3688

strarsis commented Mar 2, 2021

wetneb commented Mar 3, 2021 •

edited

strarsis commented Mar 3, 2021

wetneb commented Mar 3, 2021

strarsis commented Mar 3, 2021

wetneb commented Mar 3, 2021

strarsis commented Mar 3, 2021

ostephens commented Mar 5, 2021

tfmorris commented Mar 5, 2021

strarsis commented Mar 5, 2021

tfmorris commented Mar 5, 2021

strarsis commented Mar 6, 2021

tfmorris commented Mar 6, 2021

strarsis commented Mar 6, 2021

tfmorris commented Mar 6, 2021

strarsis commented Mar 6, 2021

wetneb commented Mar 7, 2021

strarsis commented Mar 7, 2021

tfmorris commented Mar 9, 2021

grahamjevon commented Apr 1, 2021

strarsis commented Apr 1, 2021 •

edited

Spell check feature #3688

Spell check feature #3688

Comments

strarsis commented Mar 2, 2021

Proposed solution

Alternatives considered

Additional context

wetneb commented Mar 3, 2021 • edited

strarsis commented Mar 3, 2021

wetneb commented Mar 3, 2021

strarsis commented Mar 3, 2021

wetneb commented Mar 3, 2021

strarsis commented Mar 3, 2021

ostephens commented Mar 5, 2021

tfmorris commented Mar 5, 2021

strarsis commented Mar 5, 2021

tfmorris commented Mar 5, 2021

strarsis commented Mar 6, 2021

tfmorris commented Mar 6, 2021

strarsis commented Mar 6, 2021

tfmorris commented Mar 6, 2021

strarsis commented Mar 6, 2021

wetneb commented Mar 7, 2021

strarsis commented Mar 7, 2021

tfmorris commented Mar 9, 2021

grahamjevon commented Apr 1, 2021

strarsis commented Apr 1, 2021 • edited

wetneb commented Mar 3, 2021 •

edited

strarsis commented Apr 1, 2021 •

edited