Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spell check feature #3688

Open
strarsis opened this issue Mar 2, 2021 · 20 comments
Open

Spell check feature #3688

strarsis opened this issue Mar 2, 2021 · 20 comments
Labels
clustering Issues related to the clustering operation, to merge similar values in a text column Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.

Comments

@strarsis
Copy link

strarsis commented Mar 2, 2021

Proposed solution

An integrated spell-checker (OpenSource/free if possible) for cells.

Alternatives considered

Exporting all the data and then using a different, external program.

Additional context

This is similar to Clustering but rather for full text fields.

@strarsis strarsis added Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements. Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators labels Mar 2, 2021
@wetneb
Copy link
Sponsor Member

wetneb commented Mar 3, 2021

Hi @strarsis, thanks for opening this issue!

I am wondering if you could expand a bit on the use case. OpenRefine is primarily designed for batch transformations, not manual editing of cells, so it is not clear what sorts of workflows you are thinking here. If this is about adding spell-checking to the cell editing dialog, that could be done with the browser's native spell-checker, without us needing to ship anything special.

Also I would be interested to understand better how this would relate to clustering.

@strarsis
Copy link
Author

strarsis commented Mar 3, 2021

@wetneb: So when there is a large list of data, e.g. products which all got lengthy description texts, automatic spell check would be great. This is also somehow related to clustering: As words should be spelled consistently, the means to merge similar words with their proper spelling, very similar to clustering of facets, would be very useful.

@wetneb
Copy link
Sponsor Member

wetneb commented Mar 3, 2021

So when there is a large list of data, e.g. products which all got lengthy description texts, automatic spell check would be great

This is still very vague to me. Can you describe in more detail how you would use this feature?

@strarsis
Copy link
Author

strarsis commented Mar 3, 2021

For example: The production description may domain-specific words, e.g. "Jacquard" (fabric/cloth type). And there are different variations of misspellings, like "Jaquard" and "jaquard", "jacquard" or even "jquard". The similarity-feature of the clustering feature would be very helpful. It can find all these occurences in full text itself (using the words as facets, but this may not be enough as the words could also have spaces missing, e.g. "wordJaquard"), allowing a fast and efficient correction of all these spelling mistakes.

@wetneb
Copy link
Sponsor Member

wetneb commented Mar 3, 2021

Ok I think I can roughly see your request, I would rephrase it like this: implement a variant of the clustering feature, which works at the word (or phrase) level instead of whole cell-level value. If we go for word-level that would roughly amount to splitting cell values by whitespace, clustering them, and then joining them back again. Would that workaround work for you?

@wetneb wetneb added clustering Issues related to the clustering operation, to merge similar values in a text column and removed Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators labels Mar 3, 2021
@strarsis
Copy link
Author

strarsis commented Mar 3, 2021

@wetneb: Hm, this could be a makeshift solution. Though grammar and syntax (comma, point) and sentence parts wouldn't be taken into account.

@ostephens
Copy link
Sponsor Member

For me "spellcheck" and "word clustering" seem distinct tasks.

Spellcheck suggests checking words against some existing dictionary, identifying words that do not exist in the dictionary and then either correcting the word, ignoring the word or adding the word to the dictionary so future occurrences of the word are accepted as correct. Spell checking is obviously language dependent - you need dictionaries for different languages.

Word clustering seems to be about saying "for this body of text, are there words that are similar enough that the same original word might have been meant"?

Spellcheck seems like it might be do-able but I feel like it's on the edge of the intended functionality of OpenRefine. My first thought is that an Extension might be the best approach to adding this type of functionality.

Word clustering seems like it could be very hard. Generally, the shorter the strings you are trying cluster, the more likely it is you'll see false positives in clustering, and as we increase the number of strings to be compared the number of comparisons that need to be made for nearest neighbour clustering increases at a higher rate - which makes it a difficult problem from a technical perspective.

Finally, based on the example given by @strarsis above

The production description may domain-specific words, e.g. "Jacquard" (fabric/cloth type)

It seems that there could be an overlap with Named Entity Recognition rather than spelling - perhaps part of the question here is whether the aim is to identify the occurrence of an entity, or just correct some spelling? I'm still a bit unclear on whether the point here is about "correcting" text or if it's about finding the occurrence of particular words in bodies of text where we know that the words might be mis-spelt or otherwise garbled. "Correcting" spelling is a different problem to "finding words which might be X" I think.

For me an example data set with an example of how you'd hope to manipulate in OpenRefine and what the desired output would be really helpful in understanding what's required here

@tfmorris
Copy link
Member

tfmorris commented Mar 5, 2021

I think an extension, similar to the Named Entity Recognition extension, is the right place for this. It's definitely natural language specific.

The functionality still needs more specification though. Some possible modes of operation for a GREL function:

  • return a list of words that are (potentially) misspelled in the input text
  • return the input text with all misspellings corrected (above a certain user specified confidence threshold?)
  • return a list of alternative suggestions for a misspellings in the input text (this would require some type of structured output)

Users would probably want control over a user dictionary in addition to the system supplied dictionary.

If this isn't a GREL function, but an interactive UI facility, it could be a lot more graphical and interactive with the user, but it would also be pretty far outside the current interaction model.

@strarsis
Copy link
Author

strarsis commented Mar 5, 2021

That feature would be very useful! If this is too tangential to the intended featureset of OpenRefine, are there tools that implement this kind of functionality?

@tfmorris
Copy link
Member

tfmorris commented Mar 5, 2021

That feature would be very useful!

"That feature" being what? We've discussed a number of different things and none of them are very well specified.

If this is too tangential to the intended featureset of OpenRefine, are there tools that implement this kind of functionality?

There is a Named Entity Recognition extension for OpenRefine, which I'm pretty sure is listed on our website, although I don't know if it's currently being maintained. You can also do this using the "Fetch URL" operation to ping a RESTful API.

@strarsis
Copy link
Author

strarsis commented Mar 6, 2021

@tfmorris: Apparently it isn't maintained anymore (archived). But it could still work. And now I also know how this feature is called.

@tfmorris
Copy link
Member

tfmorris commented Mar 6, 2021

Github is a distributed ecosystem, so even if the original project is no longer being maintained, one can often find another fork which has activity. You can visualize the network of forks here: https://github.com/RubenVerborgh/Refine-NER-Extension/network It looks like this fork received updates as recently as January: https://github.com/stkenny/Refine-NER-Extension

@strarsis
Copy link
Author

strarsis commented Mar 6, 2021

@tfmorris: Thanks! Does the detection require a language-specific dictionary/model?

@tfmorris
Copy link
Member

tfmorris commented Mar 6, 2021

Named Entity Recognition is both natural language specific and, to some extent, domain specific. That OpenRefine NER extension uses the Stanford CoreNLP software which has models for these languages: https://stanfordnlp.github.io/CoreNLP/human-languages.html

@strarsis
Copy link
Author

strarsis commented Mar 6, 2021

@tfmorris: This is awesome! Thanks!

@wetneb
Copy link
Sponsor Member

wetneb commented Mar 7, 2021

If that addresses the problem perhaps we could close this issue?

@strarsis
Copy link
Author

strarsis commented Mar 7, 2021

@wetneb: Yes!

@strarsis strarsis closed this as completed Mar 7, 2021
@tfmorris tfmorris changed the title Dictionary/spell check? Spell check feature Mar 9, 2021
@tfmorris
Copy link
Member

tfmorris commented Mar 9, 2021

We've had another request for spell check, so reopening to host the discussion.

@tfmorris tfmorris reopened this Mar 9, 2021
@grahamjevon
Copy link

I've had a similar request on my mind.

Problem

The problem I have is that I have columns that contain long strings of text (e.g. verbose catalogue descriptions in the form of multiple sentences). This data is created in Excel (often by people for whom English is a second language) and often contains spelling errors because Excel does not have a spell check function like Word.

Solution

A spell check and edit function which opens a new window similar to the cluster and edit function. Instead of clusters of similar cells being presented to the user, a list of unknown words will be presented to the users (i.e. a list of all words in the selected column, which are not found in a dictionary).

The user will then have the option to tick the words they want to change and to enter the replacement term into an empty field. When the user clicks "Confirm", OpenRefine will essentially apply the replace function to all of the the unknown words where the checkbox has been selected.

Current alternatives

I could do something similar outside of Open Refine using pyspellchecker. But it would be easier to have an extra cleaning function in OpenRefine while I'm cleaning the data in other ways. And this would be more accessible to people not familiar with Python.

Possible configuration features

  1. Just as the cluster function uses different clustering methods, perhaps the spell check function could allow the user to select different dictionaries. Not least because different users will use different languages.
  2. It would probably also be useful for different word boundary methods to be available for selection. e.g. \b\w*\b or value.split(' '). I'm sure there are others that would be even more useful. I imagine different users would want to define words in different ways.
  3. Perhaps it might be useful for users to be able to establish their own dictionary or list of stop words.
  4. It might be useful to include a filter within the spell check window (e.g. if the column contains English and Arabic descriptions, it might be helpful if the user could filter out any words that don't contain Latin characters.
  5. I'm not sure if I would want the spell checker to suggest a replacement word, but maybe some use cases would benefit from this.
  6. Case sensitivity is another consideration. To reduce the size of the list it might be useful to have words with the same spellings, but different cases, grouped as one word. But the replace feature would probably want to maintain the same case pattern. It might be a bit premature to think about this in detail, but it might be useful to have some degree of flexibility over case sensitivity for different use cases.
  7. It would probably useful to apply this to more than one column at a time (to reduce duplication of effort if there is more than one column with text requiring spell checking.
  8. I wonder whether it would also be useful to have the option to cluster similar unknown words, but maybe that is making things too complex. An example of where this would be useful would be in a column where the same word has being misspelled in different ways (e.g. "Administration" is misspelled as "Adminstration" and "Administraton")

Here is a very rough mock up based on the cluster and edit template.
OpenRefineSpellCheckMockUp

@strarsis
Copy link
Author

strarsis commented Apr 1, 2021

@grahamjevon: I found a quite primitive but efficient approach:
Export to Excel-compatible format, open in Excel (or Libre/OpenOffice Calc), run the spell-checker.
It also allows adding to dictionary and changing all occurrences. It works quite well.
After correction the excel can be imported back into OpenRefine for further clustering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clustering Issues related to the clustering operation, to merge similar values in a text column Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Projects
None yet
Development

No branches or pull requests

5 participants