Into (and Out of) the Weeds: Lessons Learned from my Newest Publication

Woohoo!… finally my newest publication is available via Early View in Evolutionary Applications

screenshot 2019-01-12 20.58.25

This study was a product of my Delta Science Postdoctoral Fellowship to investigate the mechanisms for effective biological control of the invasive water hyacinth in the Sacramento-San Joaquin River Delta (hereafter “Delta”).

In a nutshell:  Two weevils (insects) are currently used all over the world for the biological control of the invasive water hyacinth, including the Sacramento-San Joaquin River Delta, California. They have had variable success, with notable reduction of biomass and cover of this invasive aquatic weed in warmer climates compared to more temperate climates such as the Delta. Although temperature plays a large role in their success, I also investigated the role of genetic variation in the success of these weevils and whether there is lower genetic diversity and heterozygosity in the Delta compared to the native origin of these weevils (South America). To do this, I used polymorphic microsatellite markers  (repeating regions of DNA in the genetic blueprints of a species) to detect differences between individuals and between populations. Additionally, as myself and others noticed weevils from the field that appeared to be hybrids of these two species, I examined whether these hybrid-like weevils are genetic hybrids (meaning that they have genetic patterns representative of the genetic blueprints from both species)

In my opinion, the most important findings from this study were:

  1. We found hybrids! This is huge! These two weevils are introduced all over the world for the control of invasive water hyacinth. So now that we know hybridization occurs, it is critical since to understand how hybridization affects their success. For instance, sometimes hybrids can outperform non-hybrids (hybrid-vigor) whereas other times hybridization can decrease performance, as well as population growth (hybrid-breakdown). I am very excited however that Dr. Julie Coetzee’s laboratory in South Africa is now starting to look into the effects of hybridization between these two weevil species.. so stay tuned (I know I will!) .
    Demonstration of hybridization between the two weevils: Neochetina bruchi and N. eichhorniae
    Typical elytra markings characteristic of (a) Neochetina bruchi and (b) N. eichhorniae; compared to atypical elytra markings for (c) N. bruchi and (d) N. eichhorniae. Microsatellite markers confirmed that specimens (c & d) are hybrids. A weevil (c) from the study site in California resulted in 100% amplification of markers for N. bruchi and 80% amplification of the markers for N. eichhorniae, whereas a weevil from Texas (d) resulted in amplification of 25% of the markers for N. bruchi and 100% of the markers for N. eichhorniae.
  2. We found that low genetic variation from demographic bottlenecks (small populations of the weevils being introduced over and over again through the biological control programs), can sometimes be buffered by genetic admixture from multiple introductions. This was one of several findings from this study that was made possible through the unique combination of documented historical records from biological control programs and population genetic analyses, such as those we made with the program, FLOCK.

    Importation history and the Introduction Processes of Two Biological Control Agents of the Invasive Water Hyacinth
    Partial importation history (a, b) compared to the introduction processes predicted by FLOCK genetic analyses (c, d) of Neochetina bruchi and Neochetina eichhorniae, two weevils native to South America. Arrows depict the direction of the biological control releases and the date initially released, but do not point to the exact release site in that locality. Black lines and yellow‐filled regions represent the routes of importation history that were tested with microsatellite markers.Abbreviations are detailed in Table 1 (Hopper et al. 2019, Evolutionary Applications). Numbers next to abbreviations indicate the number of genetic sub‐clusters found from FLOCK analyses (c, d)
  3. Through combining this genetic study with a temperature performance study, we found that low genetic variation does not always hinder population adaptation or performance. This finding has been observed in other study systems, such as with the invasive Argentine ant, which has lower genetic variation in the introduced region, but is more successful than in the native range due to reduced intraspecific aggression among separate ant nests in the introduced populations. 

I also think that the lessons I learned from the process of writing this manuscript were very important, and I detail these below. 

Lesson 1: Know when to ask for help

This study culminated out of work that I did at UC Davis, advised by Dr. Ted Grosholz, and in collaboration with researchers, Dr. Paul Pratt and Dr. Kent McCue (USDA/ARS), Dr. Ruth Hufbauer (Colorado State University) and Dr. Pierre Duchesne (Université Laval, Quebec, Canada). The latter two coauthors of whom I actually contacted out of the blue during the analysis and writing portion of the study, since I felt like I needed more guidance from experts in the population genetics and data analysis field. I think knowing when to ask for help is really critical in science (no matter what your academic standing is), and it almost always improves the study to get additional opinions and critique. Think of it as a preliminary peer review before the ultimate peer review!

I also asked several folks that are experts in population genetics for advice on the collection, processing and analysis of the data before and during the start of this project, including:  Dr. Jeremy Andersen (UC Berkeley), and Dr. Rick Grosberg and Brenda Cameron (UC Davis) and Dr. Neil Tsutsui (UC Berkeley).

Lesson 2: Be Flexible, and Adapt to let the Data tell the Story

The title of this manuscript felt very suitable to me as ecological data are not always clear-cut, and sometimes it can take some time to wade through the weeds of data and figure out how to tell the accompanying story.  This is especially true for when resulting data don’t match up with your original expectations and initial story you thought you would tell. The key to this issue, is don’t try to force your old story on the data… get a second opinion if needed, and be open-minded by letting the data ‘speak’ for itself.

Lesson 3: Work Hard, Be Patient and Persistent

I think with anything that you do, sometimes a final product comes easy… and other times it seems like a long drawn out process. This project fell in the latter category, as it was my first time learning about and implementing a population genetics study, and I was working on the analysis and write-up of this study all while starting a new postdoc in an entirely new study system. I think an important aspect to finishing this project was really persistence. I spent week nights and weekends working diligently on the data analysis and writing and re-writing the paper. I also had to be patient with myself as I had to give myself time to learn the new types of analyses (which means new R packages and code!) and time to read all of the important papers in the study field.

If by chance you are also just starting a population genetic study, and feel a bit lost, please see my three-part tutorial blog posts which hopefully will provide some assistance:

  1. How-to use microsatellites for population genetics, Part I: Study Design, DNA extraction, Microsatellite Marker Design/Outsourcing
  2. Population Genetics Part II: Tips and Tricks, Multiplex PCR and Workflow of Microsatellites- the cheap way
  3. Population Genetics, Part III: Data Wrangling and Analyses

Lesson 4: Implement Self-Deadlines and Advertise them to your CoAuthors

writing_phdcomicSometimes its hard to finish something if you don’t have a deadline. So make yourself a deadline, and tell everyone about this deadline, so that you are held accountable for this timeline. I actually had some coauthors that needed me to submit this article to the journal by October 1st in order to meet some of their workplace requirements for publications. Needless to say, I pulled an all-nighter and got it in to the journal by 5am that day.. true story….

Nothing like a little pressure to light up that writing-fire…

Lesson 5: Don’t cut corners

This goes with Lesson 3, on being patient. Towards the end of writing up a big study, you might find yourself just wanting it to be over. You would do anything to not have to think about that project or the data anymore. However, crossing that finish line is actually one of the most crucial components and can make or break your ability to get into a decent journal. Having co-authors often really helps solve this problem, as they will call you out on any cut corners (if they are doing their job), and will suggest critical improvements to the paper that maybe you were thinking about.. but were just initially too lazy to do. Also on this note.. Read the proof-version (final version before being published) of the paper word for word! You don’t want any typos in your finished product.. especially true in your Title, Abstract and Figure Legends!

Lesson 6: Celebrate at Each Stage of Completion

Be sure to acknowledge your accomplishments after you submit the manuscript the first time, after the revisions and acceptance, and after the manuscript goes In Press. After all- you worked hard to get to each of those stages, and celebration will help motivate you for the next time you have to do it all over again!

writing god

 

Population Genetics, Part III: Data Wrangling and Analyses

So some good news!- My population genetics study on the two herbivorous biological control agents of water hyacinth: Neochetina bruchi and N. eichhorniae, was finally accepted for publication w/ minor revisions in Evolutionary Applications. I will certainly post it once it is In Press! This was one of the projects I did for my Delta Science Postdoctoral Fellowship research 

So with that, I will fulfill my promise on posting Part III of my ‘how-to’ series for population genetics using microsatellites.  To recap, Part I of this series explained what microsatellites are, and how to develop microsatellite markers, and Part II was on how to amplify and genotype these markers (the cheap way with universal fluorescent labeled tails, and multiplex pcr).

Part III (right here!) is my attempt to guide you through the jungle of population genetic analyses. I will discuss the main programs and analyses I used and how to properly format your data to make these packages and programs work!

NB_admixture_K2_K6
STRUCTURE analysis of N. bruchi across eight populations and eight loci

I am not going to go into nitty-gritty detail because the tutorial for the ‘poppr’ package in R, does a FANTASTIC job on guiding newbies (including my former self) through the process of how to import data into R, exploring the data, and then how to conduct some basic and advanced analyses. The link is here  http://grunwaldlab.github.io/Population_Genetics_in_R/index.html

Honestly- this is how I started learning how to conduct population genetic analyses in R.. I kid you not. I literally followed the above tutorial step by step and did almost all of the analyses just to get a feel for the data and how to run population genetic stats.

So- Where to start you ask?


Well, one of my collaborator/coauthors (Dr. Ruth Hufbauer-CSU) emphasized that before you analyze the data, a good first step is to know what your question is, and why you are asking those questions. Then you should base your analyses on those questions.

Here are some example questions:

  • Where did these samples/individuals originate from?
  • How many populations are there?
  • What is the genetic diversity in these populations, and are some populations more diverse than others? Genetic diversity is often based on one or more of the following: heterozygosity, allelic richness and diversity indices such as the Shannon, Simpson, or Nei)
  • Are there population genetic bottlenecks?
  • Is there inbreeding?
  • Are there hybrids (crosses between two species)?

Then of course you have to report some general marker- and population-based stats (Deviation from HWE- Hardy Weinberg Equilibrium, Linkage Disequilibrium (LD), overall expected and observed heterozygosity, (He and Ho), null alleles..etc).


Load the Data: Before you do anything, you have to load the data in a format that the programs recognize!

  •  GenAlex- Excel Based Program-useful to check data formatting, and reformat data for import into R or other programs. However the main thing I found useful was understanding just what your dataframe should look like, which the Poppr tutorial emphasizes nicely: here
  • Adegenet package in R- (Jombart et al., 2010) Converts any type of data frame or matrix or txt file to a format that you need for a specific type of analysis
    • For most of my data analyses, I used the following two formats, converting my csv to data that the packages could recognize, or that I could convert further:
      1. newdataname <- read.genalex(“datafile.csv”,genclone = FALSE)
        • you can convert this to a genepop format with the following code-
      2. newdataname2=read.genalex(“datafile.csv”)
        • #need genclone for gytpes conversion, hence don’t use genclone=FALSE
          • gtypesdata=genind2gtypes(newdataname2)
Screenshot 2018-12-08 12.48.53.png
Example dataframe for import and analysis with the Poppr R package. Areas selected in blue represent the Loci, Samples and Populations, see poppr tutorial for further examples

Basic and Advanced Stats- I suggest to use:

  • Poppr– (Kamvar et al., 2015; Kamvar et al., 2014) this package depends on loading a lot of other packages and guides you through analyses in the tutorial. One example- is as a wrapper for the ‘vegan’ package- poppr calculates genotype accumulation curve (see if you sampled enough loci and individuals),
  • Pegas-(Paradis, 2010) -calculate Linkage Disequilibrium (LD) and HWE across populations for each locus
  • PopGenReport-(Adamack et al., 2014)- calculate null-allele frequencies pairwise FST and Jost’s D analyses, compare total and average allelic richness (accounting for sample size) and the number of private alleles among populations
  • diveRsity– (Keenan et al., 2013)-Estimate the average observed (Ho) and expected (He) heterozygosity, deviations from HWE (exact test) and the average ‘inbreeding coefficient’ (FIS) for each population across all loci.  In my paper I distinguish FIS as a measure of increases in homozygosity due to genetic drift caused by a larger population being separated into sub populations, rather than due to consanguineous mating (Crow, 2010)
  • InbreedR- (Stoffel et al., 2016)-calculate g2 as a measure of inbreeding.

Hypothesis testing: 

  • Linear Mixed Models, or Generalized (GLMMs) depending on which is more suitable for your data- with the lmer function in the lme4 package (Bates et al., 2015): I used this to test for the effects of population (collection site) on genetic diversity. Implementing an LMM accounts for the variability of the microsatellite loci by modeling locus as a random effect, and collection site as a fixed effect with allelic richness or expected heterozygosity as the response variables in separate models. Stepwise model simplification (Crawley, 2013) can be performed using likelihood ratio tests. Differences across collection sites can be compared, based on 95% CI, using Tukey’s post-hoc test in the ‘multcomp’ package (Hothorn et al., 2008). Read more about mixed models here. 

Analyses of Population Structure

I suggest using several programs to see how they compare. I used:

  • STRUCTURE -as it is one of the most popular programs-(Pritchard et al., 2000). I used Clumpak (Kopelman, Mayzel, Jakobsson, Rosenberg, & Mayrose, 2015) to analyze the Best K, and to visualize and produce plots based on all of the runs from STRUCTURE outputs. Please see data-wrangling section below for more details on how to get your data into STRUCTURE, and also into Clumpak.
  • FLOCK- great program in excel (Duchesne & Turgeon, 2012), to see which populations are genetic sources for other populations, as well as determining ‘K’ the number of genetic clusters within a given population or site (useful to compare to output ‘K’s from STRUCTURE
  • ‘adegenet’– to conduct Discriminant Analysis of Principal Components (DAPC) (Jombart, Devillard, & Balloux, 2010). There is a great tutorial here:
DAPC analysis on microsatellite data (eight loci) from eight populations of N. bruchi
Used the Adegenet package in R, and the Adegenet DAPC tutorial

Of course life is never easy.. especially when you have a MacOSX and for some reason the world revolves around PCs.

Here are some Data-wrangling tips for getting data into STRUCTURE and ClumpaK 

  • To get my data into the STRUCTURE format, I used the function ‘genind2structure’ that I found online here. Then in R, I used: genind2structure(inputdata, file=”outputdata.txt”, pops=TRUE).
  • Following this , you will need to:
    • DELETE THE GENALEX HEADERS
    • GET RID OF ANY ‘_’ IN THE TEXT FILE
    • GET RID OF LETTERS IN POP FILE, REPLACE WITH #S
    • DELETE IND AND POP HEADER
    • SAVE AS TXT FILE (TABS DELIMINATED)
    • RUN PERL SCRIPT Below..
    • since I have a MacOSX, I had to convert from DOS to UNIX with terminal program before loading in STRUCTURE by using similar code to this: while($_ = <>){s/\r\n|\n|\r/\n/g;print “$_\n”;}
      and you can find more info here .
    • DON’T TOUCH FILE AFTER THIS.. TA DA!
    • To get my files into the Clumpak web processor, I had to use a different zip-program (Zipfiles4PC) than what the MacOSx does, as for some reason Clumpak couldn’t process- Mac-zipped files.

Ok.. I think that is enough for now.. but really.. If I can emphasize one thing it is to go through the whole Poppr tutorial to get a handle of how to analyze data in R, and a feel for YOUR data!

 

 

How-to use microsatellites for population genetics, Part I: Study Design, DNA extraction, Microsatellite Marker Design/Outsourcing

So… you want to use microsatellite markers to assess the genetic variation and population structure of your focal study organism? Well if you are anything like me two years ago.. then you have no idea where to start. Otherwise- congratulations if you are already an expert- in which case you probably don’t need to read on 🙂

SeeHearSpeak
“See No Weevil, Hear No Weevil, Speak No Weevil”                                                                          Illustration by Jacki Whisenant, contracted by Julie Hopper. Copyright 2017.

Two years ago, I was just like you (and these weevils above), and felt a bit overwhelmed and lost in undertaking the large task of designing microsatellite markers and genotyping these markers for the two weevils species (Neochetina bruchi and N. eichhorniae) that I have discussed in previous posts. 

Very briefly to recap on my work:  these two weevil species are used all over the world for the biological control of the invasive water hyacinth, including the Sacramento-San Joaquin River Delta, California. They have had variable success, with notable reduction of biomass and cover of water hyacinth in warmer climates compared to more temperate climates such as the Delta. Although temperature plays a large role in their success, I am also investigating the role of genetic variation and particularly whether there is lower genetic diversity and heterozygosity in the Delta compared to the native origin of these weevils (Uruguay and Argentina).

In Part I- (this blog), I will detail the how-to’s of sampling design and strategy, and the development of (or outsourcing) microsatellite markers.

In Part II- (next blog) I will discuss how to make your final microsatellite marker selections, and the workflow of multiplex PCR and genotyping.

In Part III- (come back in a month!) I will detail how to analyze the data with various R-packages and other computer programs, and how to format the data files correctly for these programs.

On this note, please research your study system thoroughly, as every organism is different and may require different sampling strategies and methods than I detail here for two diploid beetle species (Insecta). Additionally.. my overview below on Part I- is very brief and I definitely skip small steps to be succinct. Also my suggestions are not the only way to do things and below this blog, I post links to several other great resources. Lastly- This work is currently in prep for publication and I will post an update again after publication.


Part I: 

 

Figure from: Grunwald et al. 2017, Phytopathology
Figure from: Grunwald, N.J., Everhart, S.E., Knaus, B.J., Kamvar, Z.N. 2017. Best Practices for Population Genetic Analyses. Phytopathology 107, 1000-1010.

Sampling Design and Strategy:

First before you start sampling or ordering primers- make sure that you have a solid study question with a testable hypothesis, and a good study framework.

Next: all of the power in your genetic analyses (aka, accuracy and ability to detect differentiation among populations, etc.) depend on: 1) your sample quality (aka DNA quality), the number of samples (replicates) per treatment or location, 2) the number of high quality microsatellite markers (e.g.quality relating to two important characteristics: markers are polymorphic -having 2 or more alleles per locus-with more being better, and the markers lack true null alleles), 3) the robustness of your PCR  – whether the PCR conditions are truly suitable for your markers, and whether they can result in reproducible data, 4) the assumptions of the data and 5) the choice of statistical tests and whether the tests are truly suitable for the data.

I will cover the latter (regarding statistical tests) in a future blog, but for today I would like to focus on the ideal # of samples and the # of polymorphic markers. There has been debate about how many samples and how many markers are necessary for robust studies, and if you study an endangered species -sometimes you just have to work with what you got!

In a perfect world– you will want to make up for what you lack in samples with microsatellite markers (loci) and vice versa. So if you have a lower end of replicates, then you will want a higher number of microsatellite markers (# of loci, and more important is to have polymorphic loci with 2 or more alleles/locus) to test for each individual (replicate), and again vice-versa. There are a couple great papers that discuss sampling strategies and study design that you should definitely check out, particularly the one noted in the figure above (Grunwald et al. 2017), as well as Hale et al. 2012 which states that 25-30 individuals per population should be sufficient to accurately estimate allele frequencies given population (with some caveats). Caveats being that obviously, 25-30 individuals per population would likely NOT be enough if you only have four microsatellite markers, particularly if these markers are not polymorphic or very variable (variability referencing to the # of alleles per locus- the more the better!).. so keep this in mind. In general, with that many samples- 10-15 polymorphic markers should be fine (although the more the better), but again this depends on your study question and study system. Also, more samples might be necessary if you are interested in population differentiation (population genetic structure). In fact, in a landscape genetics study, Landguth et al. 2012 demonstrated that increasing the number of loci (and particularly having more variable loci) is more likely to increase the power of population genetic inferences compared to increasing the number of individuals.

You can also test your samples with genotype accumulation curves to see if you have captured the majority of genetic variation (I used the poppr package in R for this and will discuss more on poppr and its primer in Part III of this blog series).

With that said.. If I would have known 1 year ago what I know now…. I would have asked for folks around the world to collect more weevils for me, and I would have extracted more DNA!  Just remember.. not all of your DNA extractions are going to end up working out..due to various human error and/or preservation issues. Thus its always good to add at least 10-20 more samples than you think you need!

map_with_labels_pop_gen
Sampling locations of Neochetina bruchi and N. eichhorniae individuals that I used for the focal population genetics study (Hopper et al. In Prep). Thanks to all those who sent me weevils!

Designing or Outsourcing Microsatellite Marker Design: 

  • Marker Outsource Options: I want to first be upfront in that I actually ended up outsourcing this component of my study as I was going through a tough time and taking care of my dad who had metastatic cancer via at-home hospice care in Columbus, Ohio for two months. Needless to say- I was working remotely then, which made the decision to outsource this part of the lab work an easy decision. I researched a lot of outsource options and in the end I went with the cheaper and most recommended option by several colleagues- the Savannah River Ecology Lab at the University of Georgia. In the end I have mixed opinions on their work and please email me if you would like more info and I will detail the ups and downs.
  • Brief Workflow for designing microsatellite markers: 
    1. First! Check the literature to make sure microsatellite markers have not already been developed for your species or a sister species (the latter of which will sometimes work). Using previously developed markers is obviously the easiest and cheapest route!
    2. If the markers have not already been developed: Obtain high quality and high molecular weight DNA Extractions. I love doing 5% Chelex DNA extractions, but the resulting DNA can be full of PCR inhibitors- so I always use the second half of the DNAeasy kit to purify and clean up my DNA samples. You can also buy replacement spin columns for these kits way cheaper from Epoch Life Science. Then quantify them on a nano-drop or a similar DNA quantification instrument and additionally run them on a gel to make sure that you have ≥100 uL of ≥50 ng/uL of >10kb DNA per sample.
    3. Send to a sequencing facility (Illumina with paired ends >150bp preferred)
    4. Clean up sequences/fix Errors and Run a program called “Pal_finder”, or use a similar program. Pal_finder can analyze 454 or paired-end Illumina sequences ( ~150bp from each end).  This program sends possible primers to Primer3 for primer design and searches for how often each primer and primer pair occur.

    5. Filter the resulting data set by only including: a) sequences for which primers can be designed (e.g. enough flanking sequence) and b) primer pairs that occurred 1-3 times. Then, sort by motif length (di, tri, tetra, etc.) to quickly find tri or tetra nucleotide repeats and look to see if the motif was found in both directions of the sequence (which can be bad as they typically end up being smaller PCR products, but this depends on your goals). Finally, order a bunch of the primers that look promising-say 48 primer pairs to start, and test them out on a subset of 24 individuals, with an equal distribution of these individuals across all your study locations, or select individuals that you think will have a lot of variation. See Initial PCR testing in the next Blog. 

To be continued…

References

Grunwald, N.J., Everhart, S.E., Knaus, B.J., Kamvar, Z.N. 2017. Best Practices for Population Genetic Analyses. Phytopathology 107, 1000-1010.

Hale, M.L., Burg, T.M., Steeves, T.E. 2012. Sampling for microsatellite-based population genetic studies: 25 to 30 individuals per population is enough to accurately estimate allele frequencies. PloS one 7, e45170.

Landguth, E.L., Fedy, B.C., Oyler-McCance, S.J., Garey, A.L., Emel, S.L., Mumma, M., Wagner, H.H., Fortin, M.-J., Cushman, S.A. 2012. Effects of sample size, number of markers, and allelic richness on the detection of spatial genetic pattern. Molecular ecology resources 12, 276-284.

Helpful Resources on Getting Started for Part I

Lecture on Intro to Microsatellites