An Amazing Semester + New Job!

Happy Holidays everyone! I’m sure you can tell by my stealth mode that I’ve been in the middle of another hectic but yet wonderful semester (with no time to blog!). This semester might have been the best yet because I felt like I finally got down the rhythm of both of my classes as well as the balance of how much work to give students (so that they still learn a lot but aren’t stressed out). Plus not to mention- we were back to in-person teaching (yipeee)! So I had full-reign with outdoor field trips for my USC BISC315 Ecology Students. (You can read about how I dealt with remote teaching here)

My amazing TA (Jennifer Beatty) and I definitely took it to the max this semester. To make up for lost time, we had the students conduct ecological field studies at the Ballona Wetlands and the Abalone Cove Tide Pools as well as at USC and the nearby Natural History Museum gardens. By the end of the semester- the students were definitely pros at surveying biodiversity with transects and quadrats as well as with pitfall traps (the former for plant and intertidal organism diversity and the latter to collect ground-dwelling arthropods). We also had a cricket-behavior lab and a parasite lab (the latter where the students collected snails and then dissected them for their trematode parasites!). I had a blast and I know most of the students did too! Throughout the course of teaching them experimental design – my TA (Jennifer Beatty) and I also provided them with the flexibility to ask their own scientific questions and to design their own experiments. We also taught them how to collect, analyze and interpret the data – using Excel and R. Here is a video where I presented for the CET Faculty Showcase and I describe the importance of student ownership when it comes to teaching data analysis:

The semester then concluded with a poster symposium where the students chose their favorite study to focus on and present (photos below). It was tons of fun!

I suppose all of this makes it a bit bittersweet to announce that I will be changing it up a bit and transitioning to a new career (which I’m stoked about -as bittersweet as it is to know that I won’t be teaching Ecology next year).

ok… you are in suspense I know.. Drumroll Please….

NEW Job Alert!: I’m going to be a Sustainability Data Analyst for the Office of Sustainability at USC! I could not be more ecstatic to combine my skills in data analytics and visualization with my passion for sustainability. I will definitely write more about my position once I start, but briefly: I will be responsible for collecting, analyzing and visualizing all the different data on USC campus that relates to USC’s sustainability initiatives. Ultimately this data will be used to evaluate areas where we can improve -including metrics like: waste, water and electricity usage, education and engagement, etc. This data will be presented in reports that are assessed by the STARS – Sustainability Tracing Assessment Rating System. The ultimate goal is to go from our current silver rating to gold and eventually platinum! This job is awesome because I get to stay at USC and I get to be part of this fabulous sustainability ride. Stay tuned and enjoy your holiday, or even just every day!

Recommended holiday break reading

‘The Power of Now’ by Eckhart Tolle – all about being present in the moment, really helpful during these chaotic and uncertain times.

“Do you dream of Terra-Two” by Temi Oh – I would describe it as similar ish to Harry Potter but with astronauts and space (no magic) – the first fiction I’ve read since college and it was delightful!

Population Genetics, Part III: Data Wrangling and Analyses

So some good news!- My population genetics study on the two herbivorous biological control agents of water hyacinth: Neochetina bruchi and N. eichhorniae, was finally accepted for publication w/ minor revisions in Evolutionary Applications. I will certainly post it once it is In Press! This was one of the projects I did for my Delta Science Postdoctoral Fellowship research 

So with that, I will fulfill my promise on posting Part III of my ‘how-to’ series for population genetics using microsatellites.  To recap, Part I of this series explained what microsatellites are, and how to develop microsatellite markers, and Part II was on how to amplify and genotype these markers (the cheap way with universal fluorescent labeled tails, and multiplex pcr).

Part III (right here!) is my attempt to guide you through the jungle of population genetic analyses. I will discuss the main programs and analyses I used and how to properly format your data to make these packages and programs work!

NB_admixture_K2_K6
STRUCTURE analysis of N. bruchi across eight populations and eight loci

I am not going to go into nitty-gritty detail because the tutorial for the ‘poppr’ package in R, does a FANTASTIC job on guiding newbies (including my former self) through the process of how to import data into R, exploring the data, and then how to conduct some basic and advanced analyses. The link is here  http://grunwaldlab.github.io/Population_Genetics_in_R/index.html

Honestly- this is how I started learning how to conduct population genetic analyses in R.. I kid you not. I literally followed the above tutorial step by step and did almost all of the analyses just to get a feel for the data and how to run population genetic stats.

So- Where to start you ask?


Well, one of my collaborator/coauthors (Dr. Ruth Hufbauer-CSU) emphasized that before you analyze the data, a good first step is to know what your question is, and why you are asking those questions. Then you should base your analyses on those questions.

Here are some example questions:

  • Where did these samples/individuals originate from?
  • How many populations are there?
  • What is the genetic diversity in these populations, and are some populations more diverse than others? Genetic diversity is often based on one or more of the following: heterozygosity, allelic richness and diversity indices such as the Shannon, Simpson, or Nei)
  • Are there population genetic bottlenecks?
  • Is there inbreeding?
  • Are there hybrids (crosses between two species)?

Then of course you have to report some general marker- and population-based stats (Deviation from HWE- Hardy Weinberg Equilibrium, Linkage Disequilibrium (LD), overall expected and observed heterozygosity, (He and Ho), null alleles..etc).


Load the Data: Before you do anything, you have to load the data in a format that the programs recognize!

  •  GenAlex- Excel Based Program-useful to check data formatting, and reformat data for import into R or other programs. However the main thing I found useful was understanding just what your dataframe should look like, which the Poppr tutorial emphasizes nicely: here
  • Adegenet package in R- (Jombart et al., 2010) Converts any type of data frame or matrix or txt file to a format that you need for a specific type of analysis
    • For most of my data analyses, I used the following two formats, converting my csv to data that the packages could recognize, or that I could convert further:
      1. newdataname <- read.genalex(“datafile.csv”,genclone = FALSE)
        • you can convert this to a genepop format with the following code-
      2. newdataname2=read.genalex(“datafile.csv”)
        • #need genclone for gytpes conversion, hence don’t use genclone=FALSE
          • gtypesdata=genind2gtypes(newdataname2)

Screenshot 2018-12-08 12.48.53.png
Example dataframe for import and analysis with the Poppr R package. Areas selected in blue represent the Loci, Samples and Populations, see poppr tutorial for further examples

Basic and Advanced Stats- I suggest to use:

  • Poppr– (Kamvar et al., 2015; Kamvar et al., 2014) this package depends on loading a lot of other packages and guides you through analyses in the tutorial. One example- is as a wrapper for the ‘vegan’ package- poppr calculates genotype accumulation curve (see if you sampled enough loci and individuals),
  • Pegas-(Paradis, 2010) -calculate Linkage Disequilibrium (LD) and HWE across populations for each locus
  • PopGenReport-(Adamack et al., 2014)- calculate null-allele frequencies pairwise FST and Jost’s D analyses, compare total and average allelic richness (accounting for sample size) and the number of private alleles among populations
  • diveRsity– (Keenan et al., 2013)-Estimate the average observed (Ho) and expected (He) heterozygosity, deviations from HWE (exact test) and the average ‘inbreeding coefficient’ (FIS) for each population across all loci.  In my paper I distinguish FIS as a measure of increases in homozygosity due to genetic drift caused by a larger population being separated into sub populations, rather than due to consanguineous mating (Crow, 2010)
  • InbreedR- (Stoffel et al., 2016)-calculate g2 as a measure of inbreeding.

Hypothesis testing: 

  • Linear Mixed Models, or Generalized (GLMMs) depending on which is more suitable for your data- with the lmer function in the lme4 package (Bates et al., 2015): I used this to test for the effects of population (collection site) on genetic diversity. Implementing an LMM accounts for the variability of the microsatellite loci by modeling locus as a random effect, and collection site as a fixed effect with allelic richness or expected heterozygosity as the response variables in separate models. Stepwise model simplification (Crawley, 2013) can be performed using likelihood ratio tests. Differences across collection sites can be compared, based on 95% CI, using Tukey’s post-hoc test in the ‘multcomp’ package (Hothorn et al., 2008). Read more about mixed models here. 

Analyses of Population Structure

I suggest using several programs to see how they compare. I used:

  • STRUCTURE -as it is one of the most popular programs-(Pritchard et al., 2000). I used Clumpak (Kopelman, Mayzel, Jakobsson, Rosenberg, & Mayrose, 2015) to analyze the Best K, and to visualize and produce plots based on all of the runs from STRUCTURE outputs. Please see data-wrangling section below for more details on how to get your data into STRUCTURE, and also into Clumpak.
  • FLOCK- great program in excel (Duchesne & Turgeon, 2012), to see which populations are genetic sources for other populations, as well as determining ‘K’ the number of genetic clusters within a given population or site (useful to compare to output ‘K’s from STRUCTURE
  • ‘adegenet’– to conduct Discriminant Analysis of Principal Components (DAPC) (Jombart, Devillard, & Balloux, 2010). There is a great tutorial here:

DAPC analysis on microsatellite data (eight loci) from eight populations of N. bruchi
Used the Adegenet package in R, and the Adegenet DAPC tutorial

Of course life is never easy.. especially when you have a MacOSX and for some reason the world revolves around PCs.

Here are some Data-wrangling tips for getting data into STRUCTURE and ClumpaK 

  • To get my data into the STRUCTURE format, I used the function ‘genind2structure’ that I found online here. Then in R, I used: genind2structure(inputdata, file=”outputdata.txt”, pops=TRUE).
  • Following this , you will need to:
    • DELETE THE GENALEX HEADERS
    • GET RID OF ANY ‘_’ IN THE TEXT FILE
    • GET RID OF LETTERS IN POP FILE, REPLACE WITH #S
    • DELETE IND AND POP HEADER
    • SAVE AS TXT FILE (TABS DELIMINATED)
    • RUN PERL SCRIPT Below..
    • since I have a MacOSX, I had to convert from DOS to UNIX with terminal program before loading in STRUCTURE by using similar code to this: while($_ = <>){s/\r\n|\n|\r/\n/g;print “$_\n”;}
      and you can find more info here .
    • DON’T TOUCH FILE AFTER THIS.. TA DA!
    • To get my files into the Clumpak web processor, I had to use a different zip-program (Zipfiles4PC) than what the MacOSx does, as for some reason Clumpak couldn’t process- Mac-zipped files.

Ok.. I think that is enough for now.. but really.. If I can emphasize one thing it is to go through the whole Poppr tutorial to get a handle of how to analyze data in R, and a feel for YOUR data!