Spreadsheet question: Need to find dupes in large-ish list

1.5K

•

Columns for each entry are: Number of source data set Number within data set First name(s) Last name DOB Date of death Age at death (formula)

Bit more than 7,000 entries so far. Using LibreOffice under Linux.

Any bright ideas on how to check for dupes using the spreadsheet?

Guess I could export the list into a text file and use some kind of dupe finder that works on those.

Columns for each entry are: Number of source data set Number within data set First name(s) Last name DOB Date of death Age at death (formula) Bit more than 7,000 entries so far. Using LibreOffice under Linux. Any bright ideas on how to check for dupes using the spreadsheet? Guess I could export the list into a text file and use some kind of dupe finder that works on those.

[–] • 3 pts

Is it a dupe if it matches on any, or is it on all.

I usually pass it through bash and do sort | uniq

It also depends on if you want to remove dupes or mark them. To count dupes you use sort | uniq -c | sort -n The sort -n is optional but clusters all of your duplicated ones together.

link

[–] • 0 pt

I'll give that a try. Thank you.

parent
link

[–] [deleted] • 2 pts

(edited )

In Excel I sort the data then compare the current row to previous row. Return true if the current row matches the previous row. Delete all rows marked as true.

link

[–] • 0 pt

(edited )

This is what I do. Sort alphabetically then in col to right second row (B2) add formula e.g. =if(A1=A2,0,1). Double click the formula to copy all the way down.

All duplicates will have a 0 next to them, originals a 1. Now copy the new column and paste over as 'values'.

Sort by the new column and delete all the ones with a zero.

Alternatively stop being a spreadsheet faggot and learn how to use a database.

parent
link

[–] • 1 pt

(edited )

excel has an option called 'conditional formatting' that is up on the main menu. click on it, and there is a drop down called 'highlight cell rules' and then you select 'highlight duplicate values'. i dont know if libre office has this functionality. as an accountant, libre office is shit. i say that as a bit of an open source nerd, too

here's a link to screenshot: https://files.catbox.moe/qm4afh.png

there are other ways mentioned here in this thread, i just thought id point this method out too

link

[–] • 1 pt

I don't know how in Linux, but excel can remove duplicates.

You can also (gasp) paste the data into a google sheet and do it in your browser, if you don't mind google spying on your data of course.

link

[+] [deleted] • 1 pt

[–] • 1 pt

Conditional formatting in Excel works ok, but I prefer to use Access - queries make it really simple (it's much better than Excel when you are comparing separate lists imo - especially when they're not all in the same order, or if more entries exist in one list than the other). I also use Access to compare two reports to weed out duplicate accounts, I then have two additional queries that identify which accounts exist in one report but not the other.

link

[–] • 1 pt

Pivot table in excel or the cool Linux equivalent and display "count" on the entries. Anything with more than a 1 count is a duplicate.

link

[–] • 0 pt

That should work. Thanks.

parent
link

[–] • 2 pts

It's very crude. I'm positive there are better ways to do it in excel that would hilight and group dupes but I'd be jewgling how. Also can copy/paste the results of the pivot table and then sort that to hide all the 1's.

parent
link

(post is archived)