As a professor that often leads the data and analysis of scholarly research projects, and collaborates across many universities, I have encountered good, bad, and downright terrible data being passed around in excel files. This is usually the result of Ph.D. students still moving up the learning curve but can be the result of other issues. So I created this document to help students not be terrible at hand-collecting, recording, transferring, and (if we absolutely must) manipulating data in excel. Further, research professors may find a tip that allows them to further optimize their already no-doubt finely-honed data skills. This -much like the vaunted U.S. Constitution- will be a living document that evolves and grows with time. Constitution constructionist trolling aside, this simply means that I will be updating this document as I think of new tips or students find new ways to anger me with the data sent to me for a project. Tips are numbered merely for easy reference not to confer some order or importance upon them. Importance and/or recency of data anger will be conferred with exclamation points and/or italics. That said, the number 1 rule is pretty important.

Why should I care? First, there is the classic computer science mantra: garbage in, garbage out. Heaven forbid, but badly input data could, if not corrected along the way by a professor overseeing the work, bias a project’s results that will be submitted to a conference or journal. Second (and more likely with skilled researchers guiding the project), poorly collected data negatively impacts a research project through additional time wasted getting the data to merge properly, cleaning the messy data, and analyzing models with poorly thought out variables. The latter usually results in additional data collecting (if still possible) to replace the poorly thought out variables with better ones. All this wasted time and effort also creates frustration. If the same person is doing all of these tasks, you are only hurting yourself by ignoring these tips. If a student is collecting data poorly for a professor this can lead to… unpleasantries. Avoid future pain by following these rules.

Note: While using excel is a reasonable choice for data that must be hand-collected, it should be an absolute last resort for otherwise interacting with the data. And no, already knowing how to do something in excel rather than learning how to do it in SAS, Stata, or R is not a valid reason. Don’t be lazy.

  1. A column in excel represents a single variable that someone will do statistics on later! Treat it as such!!! This means a) figuring out what the variable will be later in the model to determine how the data should be entered and b) only entering that type of data. For instance, faced with collecting if an executive is a CEO a) figuring out that a binary variable is a wise choice for this b) entering a 1 or 0 only!!! Rather than the terrible text strings that often become an incoherent combination of YES, y, no, YeS, 1, Y, N, NO, Co-CEO, Interm CEO, [blank], Interm CEO that was later approved to be full CEO, NA, N/A, I couldn’t find this one, Not applicable, President Only, etc.
  2. Use dummy variables whenever possible. Many, many things can be converted into one or more dummies if you consider it first. So instead of ‘executive position’ with a bunch of free form text (similar to the example text in #1) create a dummy for each position: CEO, COO, CFO, etc.
  3. Expand your dummy variables as needed rather than breaking the #1 rule! For instance, after setting up dummies in #2 if one encounters a Co-CEO instead of abandoning your principals and writing “Co-CEO” in the column for CEO, create a new Co-CEO column that can still be coded with a 1 or 0. Make a note (see #4) as to whether Co-CEOs are still coded as a CEO (i.e., is the Co-CEO variable additive to the CEO variable or replacing it).
  4. Put any notes in a separate column or columns. Depending on the amount and type of data being collected in excel a single ‘notes’ column may suffice or it may be better to create a notes field for each variable that needs it. Name it something obvious like executive_scandal_notes for notes on the executive_scandal variable.
  5. Enter dates numerically (no text months) and consistently!!! Generally, this should be either: 2020/3/25 or 3/25/2020. Yes, doing the placeholder zeros (e.g., 2020/03/25) is a little cleaner but most programs can handle importing correctly without them and people tend to struggle with including them. I often see a huge mix of date styles in a single column: 3/25/1999, March 25th, 1999, 3-25-1999, 3.25.1999, Mar 25, Mar 25 99, 25th Mar 1999, etc. Which neither imports well or is easy to clean with a script as it is inconsistent (e.g., a script splitting on the /, or grabbing the final 4 characters for the year).
  6. Never change an ID in excel. These are often used to merge the excel data with other data so if you change the ID the merge will fail. Instead, create a new column such as ID_updated with the change you want to make. This allows for merging on the newly updated ID without throwing away the old ID through your reckless action.
  7. Unless you explicitly know otherwise, assume every variable is an ID variable (or otherwise similarly important). Disk space is practically free while time is incredibly valuable, so instead of changing data in an existing variable (column) make a new variable (column) to make your changes. For instance, to get the stock market reaction to an event the date often will need to not the event_date but the first day the stock market was open following the event_date. The programs that retrieve stock market data also want the data formatted in a particular way. This, therefore, presents two temptations to change the event_date while retrieving this additional data – don’t do it. Make a new stock_reaction_date variable and leave the event_date unmolested so that one can easily merge the new data into the existing dataset.
  8. Really it is best to avoid changing any variables data if at all possible. In addition to breaking merges if the variable is an ID, the variable in question may undergo cleaning and or other manipulation in code after import that your changes might break thus also wasting others’ precious time. This is extremely likely when adding additional data (usually at the request of reviewers) later in a project. Again, disk space is practically free while time is incredibly valuable so make another column and use another 0.0000001% of your disk space rather than potentially waste hours of time.
  9. Be careful of text IDs with preceding zeros (e.g., 0023817). Excel likes to throw these zeros away. One must make the column a ‘text’ format (rather than general, number, etc) to keep it from doing this. Compustat’s unique ID named gvkey is such a variable.
  10. Don’t hand make a variable that can be created in a program. For instance, perhaps one’s eventual model will consider if CEOs are fired the same day a scandal is announced by the firm. Collect the two dates (see #5) in two variables scandal_announcement_date and CEO_fired_date, but do not create a human entered dummy (it should be a dummy see #2) CEO_fired_on_announce_day as one can easily and without mistakes create this dummy in Stata/SAS/R by comparing the two dates.
  11. Any time one wants to enter N/A in a variable consider if a) the way you are creating this variable is a good one or if b) one merely doesn’t understand what N/A actually means. A) Could you use a different coding and/or additional variables (see #3) to capture the data without a useless N/A that both breaks #1 and doesn’t capture whatever is driving you to put N/A in the first place? B) If you are entering N/A into a mutually exclusive dummy you are using it wrong. E.g., CEO_has_former_CEO_experience has to be a 1 or a 0 (or a missing if one doesn’t know) the newly hired CEO either was a CEO before or not, therefore N/A makes no sense. And senseless data makes everyone unhappy.
  12. Watch out for spaces before and after data being entered. Macs particularly like to “help you” by adding a space after any and every word ever because using the space bar is not elegant or something. Depending on the data an extra space can break things on import or later in analysis and are a pain to locate as visually the data looks right.
  13. If you are asked to fill in missing data in excel, don’t change a bunch of other things without checking with your professor first. You may be changing IDs needed to merge your collected missing data back into the dataset or otherwise messing up the dataset (see #6, #7, and #8 ).
  14. Do not use multiple tabs in excel. If you are using multiple tabs in excel this means that you are likely doing some combination of the following: creating unnecessary duplicates, reusing variable names that can only be told apart by referencing the tab names, using variable names that are generally unclear without the tab names, and/or collecting data in a format different than the rest of the data that will require additional work to merge.
  15. The variable name (top excel row) of the data should a short but meaningful name. Preferably in lowercase with underscores for spaces unless the variable exists somewhere else already then it should match that other naming exactly. For instance, firm_age is obvious in the data it contains, but even more complicated variables such as firm_FEC_total_amt_cycle reasonably conveys that is the firm’s total amount of FEC (federal election commission tracked campaign contribution) dollars in an election cycle (rather than in a year). Stata’s limit of 32 characters only allows so much description in the name but it should be helpful without referencing the longer variable label.
  16. The top excel row is for the variable names of the data below it and nothing else. This is not a place for notes to self (see #4) or even explanations of the data below. If you need a note to explain the data collected below, the variable is either ill-conceived (see #1, #2, and #3) or poorly named (see #15), or both.