Zoom Logo

TOG RNAseq Workshop (Part I: From Raw to Processed Data) - Shared screen with speaker view
TOG - Amy Inkster
23:33
if you have questions throughout this workshop, please visit this link! https://etherpad.opendev.org/p/TOG_RNAseq_workshop
Adriana Cabrera
25:39
raw sequencing data
Andrew Brown
25:49
FASTQ = sequence + quality
Jiyoung Han
25:51
FASTQ file is type of FASTA file with Quality score
TOG - Amy Inkster
34:38
link to fastq -> bam page, download the “FASTQ_to_BAM.html” object and open in a web browser https://github.com/BCCHR-trainee-omics-group/StudyGroup/tree/master/workshops/RNA-seq-Workshop-2021/data
Aurora Mattison
39:15
Me too!
Wan-Chun (Kathy) Chang
39:18
Same as me
Isaac Rosado Sanchez
39:19
Same here
Rocky
41:47
you can also right-click "Raw" and use the "Save link as"
Che-Min Lee
42:38
Thanks Rocky!
TOG - Amy Inkster
43:47
library(tidyverse)library(here)library(rmarkdown)library(knitr)# formatting packageslibrary(kableExtra)library(janitor)library(scales)library(ggpubr)
TOG - Amy Inkster
45:02
install.packages(“tidyverse”) # if missing any above packages
TOG - Amy Inkster
45:30
library(Rsubread)
TOG - Amy Inkster
47:26
here::here()
TOG - Amy Inkster
48:40
counts <- featureCounts(here::here("data", "HG00097.mapped.ILLUMINA.bwa.GBR.exome.20130415.bam"))
TOG - Amy Inkster
48:50
# above gives 0 counts
TOG - Amy Inkster
49:49
counts <- featureCounts(here::here("data", "HG00097.mapped.ILLUMINA.bwa.GBR.exome.20130415.bam"), annot.inbuilt = "hg19",isPairedEnd = TRUE)
TOG - Amy Inkster
50:47
counts_1 <- counts$counts
Samantha Mar
50:51
we don’t have the bam file though
Sofya Langman
51:23
Yeah, I see a list of files, but non of them are .bam
Adriana Cabrera
51:31
what is the Y axis? sorry
Mahara Mtawali
51:40
Also a little unsure where to get the bam file
Louisa Tambunting
51:49
Me too
Yuyin Yi
51:51
same
TOG - Amy Inkster
52:04
sorry for any confusion, nikita was using this BAM just a demo, we are going to use the eDat.txt and pDat.txt files in a second
Louisa Tambunting
52:14
thank you!
Samantha Mar
52:21
Thanks!
Mahara Mtawali
52:24
thanks
Yuyin Yi
52:26
thanks
Marty Yue
52:37
just curious, how would you download the bam file from the given link?
Marty Yue
53:29
clicking the download icon (in front of the ftp link) doesn't really download anything for me
TOG - Amy Inkster
54:12
eDat <- read.delim(here::here("data", "GSE157103_formatted_eDat.txt"), sep = "\t")
TOG - Amy Inkster
54:20
pDat <- read.delim(here::here("data", "GSE157103_formatted_pDat.txt"), sep = "\t")
TOG - Amy Inkster
55:37
# if you are missing the eDat or pDat files go to the GitHub links (posted in etherpad), then open the file, right click download -> “save link as”
Aurora Mattison
57:50
Will you be able to post this R notebook online afterwards too?
TOG - Amy Inkster
58:13
yes definitely!
TOG - Amy Inkster
58:21
view(as_tibble(pDat))
TOG - William Casazza
58:28
Yes, everything will be available on GitHub
TOG - William Casazza
58:38
@Marty I can answer that over on the etherpad!
Marty Yue
59:26
Thanks! I have been struggling on that
TOG - Amy Inkster
01:00:32
sorry i missed your Q marty

eDat <- eDat %>%column_to_rownames(var = "gene")
Marty Yue
01:00:52
no worries
TOG - William Casazza
01:01:30
If you're comfortable having your question up for everyone to see it, you can also post it on the ether pad: https://etherpad.opendev.org/p/TOG_RNAseq_workshop
TOG - Amy Inkster
01:01:53
all(colnames(eDat) == pDat$ID)
TOG - Amy Inkster
01:02:38
names(pDat)
TOG - Amy Inkster
01:03:18
# count covid patients per sex
pDat %>%dplyr::count(COVID, Sex)
TOG - Amy Inkster
01:05:13
# count covid patients by ICU status yes/no
pDat %>%dplyr::count(COVID, ICU)
TOG - Amy Inkster
01:06:55
pDat %>%dplyr::count(Age)
Divya Anantsri
01:09:37
Would you use data visualization at this point? To plot the Sanity checks?
Divya Anantsri
01:10:03
Got it! Thanks!
TOG - Amy Inkster
01:13:06
pDat <- pDat %>%mutate(age_cat = case_when(Age < 20 ~ "below_20",between(Age, 21, 40) ~ “twentyone_forty”,between(Age, 41, 60) ~ “fortyone_sixty”,Age > 60 ~ "above_60"))
TOG - Amy Inkster
01:13:53
pDat %>% count(age_cat, COVID)
TOG - Amy Inkster
01:15:23
pDat %>%ggplot(aes(x = age_cat, fill = age_cat)) +geom_bar(stat = "count") +facet_wrap(~COVID)
TOG - Amy Inkster
01:16:07
pDat$Age_bracket <- fct_relevel(pDat$age_cat, c("below_20", "21-40", "41-60", "above_60"))
TOG - Amy Inkster
01:16:50
pDat$age_cat <- as.factor(pDat$age_cat)
TOG - Amy Inkster
01:17:23
# correct code: 

pDat$Age_bracket <- fct_relevel(pDat$age_cat, c("below_20", "twentyone_forty", "fortyone_sixty", "above_60"))
TOG - Amy Inkster
01:19:53
# make new variable w protein measurements

proteins <- pDat %>%dplyr::select(COVID, Ferritin_ng.ml, CRP_mg.l, Procalcitonin_ng.ml, Lactate_mmol.l, Fibrinogen_mg.dL)
TOG - Amy Inkster
01:21:48
# select ID in proteins data frame

proteins <- pDat %>% 
 dplyr::select(ID, Ferritin_ng.ml, CRP_mg.l, Procalcitonin_ng.ml, Lactate_mmol.l, Fibrinogen_mg.dL)
TOG - Amy Inkster
01:22:09
# now convert data to long format 

proteins <- proteins %>%pivot_longer(cols = 2:6, names_to = "protein", values_to = "measurement")
TOG - Amy Inkster
01:23:00
# selecting covid in proteins data frame 

proteins <- pDat %>% 
 dplyr::select(COVID, Ferritin_ng.ml, CRP_mg.l, Procalcitonin_ng.ml, Lactate_mmol.l, Fibrinogen_mg.dL)
TOG - Amy Inkster
01:23:10
# pivot longer

proteins <- proteins %>%pivot_longer(cols = 2:6, names_to = "protein", values_to = "measurement")
TOG - Amy Inkster
01:23:33
# plot protein levels by covid status 

proteins %>%ggplot(aes(x = COVID, y = measurement, fill = COVID)) +geom_boxplot() +#geom_jitter(shape = 16, colour = "grey", alpha = 0.5, width = 0.2) +scale_fill_manual(values = c("orange", "grey")) +facet_wrap(~protein, scales = "free") + #scales = free allows the y-axis of each plot to have variable limitstheme_minimal()
TOG - Amy Inkster
01:26:16
library(reshape2)
TOG - Amy Inkster
01:27:06
# “melt” (make longer) our exprs data
# same functionality as pivot_longer

e_melt <- melt(eDat)
TOG - Amy Inkster
01:28:25
e_melt %>%ggplot(aes(x=value, fill = variable)) +geom_density(alpha=0.1)
TOG - Amy Inkster
01:28:56
# remove legend 
e_melt %>%ggplot(aes(x=value, fill = variable)) +geom_density(alpha=0.1) +theme(legend.position = "none")
TOG - Amy Inkster
01:29:13
# take log2 of express data
e_melt %>%ggplot(aes(x=log2(value), fill = variable)) +geom_density(alpha=0.1) +theme(legend.position = "none")
TOG - Amy Inkster
01:31:24
# convert zero values into log2(x+1) e_melt <- e_melt %>%mutate(log_x_1 = log2(value + 1))
TOG - Amy Inkster
01:32:23
# plot the log transformed 

e_melt %>%ggplot(aes(x=log_x_1, fill = variable)) +geom_density(alpha=0.1) +theme(legend.position = "none")
TOG - Amy Inkster
01:34:14
# make new object for correlations

samp_cor <- cor(eDat)
TOG - Amy Inkster
01:35:26
pDat <- pDat %>%column_to_rownames(var="ID")
TOG - Amy Inkster
01:35:39
view(samp_cor)
TOG - Amy Inkster
01:36:19
library(pheatmap)
TOG - Amy Inkster
01:36:27
h1 <- samp_cor %>%pheatmap(clustering_distance_cols = "euclidean", clustering_method = "complete", cluster_rows = TRUE,show_colnames = FALSE, show_rownames = FALSE,annotation_row = pDat[c("COVID", "Sex")], annotation_col = pDat[c("COVID", "Sex")],main = "Sample Correlations")
TOG - Amy Inkster
01:38:06
# same code with nicer formatting

h1 <- samp_cor %>%pheatmap(clustering_distance_cols = "euclidean",clustering_method = "complete",cluster_rows = TRUE,show_colnames = FALSE,show_rownames = FALSE,annotation_row = pDat[c("COVID", "Sex")],annotation_col = pDat[c("COVID", "Sex")],main = "Sample Correlations")
TOG - Amy Inkster
01:39:29
# add custom colors to heatmapannot_cols <- list(COVID = c(`yes` = "grey",`no` = "orange"),Sex = c(`male` = "sea green",`female` = "purple",`unknown` = "yellow"))
TOG - Amy Inkster
01:40:18
# now when we plot the heat map with these new annotation colours

h1 <- samp_cor %>%pheatmap(clustering_distance_cols = "euclidean",clustering_method = "complete",cluster_rows = TRUE,show_colnames = FALSE,show_rownames = FALSE,annotation_row = pDat[c("COVID", "Sex")],annotation_colors = annot_cols,main = "Sample Correlations")
TOG - Amy Inkster
01:41:01
# oops, above was missing column colours

h1 <- samp_cor %>%pheatmap(clustering_distance_cols = "euclidean",clustering_method = "complete",cluster_rows = TRUE,show_colnames = FALSE,show_rownames = FALSE,annotation_row = pDat[c("COVID", "Sex")],annotation_col = pDat[c("COVID", "Sex")],annotation_colors = annot_cols,main = "Sample Correlations")
TOG - Amy Inkster
01:41:42
# to view the plot

h1
TOG - Amy Inkster
01:43:14
dim(eDat)
TOG - Amy Inkster
01:45:16
# remove sequences w 0 count in all samplese_fil <- eDat %>%rownames_to_column(var = "gene") %>%filter_at(vars(-gene), any_vars(. != 0)) %>%column_to_rownames(var = "gene")
TOG - Amy Inkster
01:45:32
# check what we are left with 

dim(e_fil)
TOG - Amy Inkster
01:45:54
# should be 18340 126
TOG - Amy Inkster
01:47:24
# now keep seqs with count at least 1 in all sampse_fil2 <- eDat %>%rownames_to_column(var = "gene") %>%filter_at(vars(-gene), all_vars(. >= 1)) %>%column_to_rownames(var = "gene")
TOG - Amy Inkster
01:47:48
dim(e_fil2)
TOG - Amy Inkster
01:47:55
# should be 1860 126
TOG - Amy Inkster
01:48:37
# now keep seqs with count at least 2 in all sampse_fil3 <- eDat %>%rownames_to_column(var = "gene") %>%filter_at(vars(-gene), all_vars(. >= 2)) %>%column_to_rownames(var = "gene")
TOG - Amy Inkster
01:48:55
dim(e_fil3) # 11384
TOG - Amy Inkster
01:50:08
# to keep seqs w count >=1 in 30% of samps(30*126)/100
TOG - Amy Inkster
01:51:27
# create logical df with all cells that are >=1

RPM_morethan1 <- eDat >= 1
TOG - Amy Inkster
01:52:40
# create a dataframe that sums number of trues per row# if sum>=38 samps, will give TRUEe_fil4 <- as.data.frame(rowSums(RPM_morethan1) >= 38)
TOG - Amy Inkster
01:53:32
#keeping only sequences which have a total of 38 TRUE (i.e. an RPM of >= 1) values or moree_fil4 <- e_fil4 %>%filter(.[1] == "TRUE")
TOG - Amy Inkster
01:54:14
# subset eDat dataframe to only genes present in efilt4 objecte_fil4 <- eDat %>%filter(rownames(eDat) %in% rownames(e_fil4))
TOG - Amy Inkster
01:54:29
# check output 

dim(e_filt4)
TOG - Amy Inkster
01:54:48
# oops typo

dim(e_fil4)
TOG - Amy Inkster
01:56:41
melt_fil2 <- melt(e_fil2)
TOG - Amy Inkster
01:57:16
g1 <- melt_fil2 %>%ggplot(aes(x=log2(value), fill = variable)) +geom_density(alpha=0.1) +theme(legend.position = "none")
TOG - Amy Inkster
01:57:46
# show plot 

g1
TOG - Amy Inkster
01:58:07
# normalization 

library(edgeR)
TOG - Amy Inkster
01:59:37
genes <- as.data.frame(rownames(e_fil2))
TOG - Amy Inkster
02:00:06
# normalize 

norm <- DGEList(counts = e_fil2, samples = pDat, genes = genes, group = pDat$COVID)
TOG - Amy Inkster
02:00:15
eNorm <- calcNormFactors(norm, method = "RLE")eNorm <- cpm(eNorm)eNorm <- as.data.frame(eNorm)
TOG - Amy Inkster
02:02:48
# melt for plotting (make df longer)

melt_norm <- melt(eNorm)
TOG - Amy Inkster
02:04:09
# convert column to row names

eNorm <- eNorm %>%column_to_rownames(var="rownames(e_fil2)")
TOG - Amy Inkster
02:04:22
# then melt

melt_norm <- melt(eNorm)
TOG - Amy Inkster
02:04:49
# plot norm values

norm <- melt_norm %>%ggplot(aes(log2(value), color = variable, fill = variable)) +geom_density(alpha = 0.1) +theme_minimal() +theme(legend.position = "none") 

norm
TOG - Amy Inkster
02:06:07
# set the seed

set.seed(20)
TOG - Amy Inkster
02:07:03
# randomly sample 

random_name <- eDat %>%sample_n(1)
TOG - Amy Inkster
02:07:13
row.names(random_sample)
TOG - Amy Inkster
02:08:13
# sorry typo above,

row.names(random_name)
TOG - Amy Inkster
02:09:28
# pull out row with TRIM62 gene 

sample_from_eNorm <- eNorm %>%rownames_to_column(var = "gene") %>%filter(gene == "TRIM62") %>%column_to_rownames(var = "gene")
TOG - Amy Inkster
02:10:31
sample_from_raw <- eDat %>%rownames_to_column(var = "gene") %>%filter(gene == "TRIM62") %>%column_to_rownames(var = "gene")
TOG - Amy Inkster
02:11:58
row.names(random_sample)[1] <- "TRIM62_raw"row.names(sample_from_eNorm)[1] <- "TRIM62_norm"
TOG - Amy Inkster
02:12:17
# bind the raw and norm rows together

TRIM62 <- rbind(sample_from_raw, sample_from_eNorm)
TOG - Amy Inkster
02:12:54
# now, make this easier to plotTRIM62 <- TRIM62 %>%rownames_to_column(var = "TRIM62_value")
TOG - Amy Inkster
02:13:06
# pivot longer 

TRIM62 <- TRIM62 %>%pivot_longer(cols = -c(TRIM62_value), names_to = "sample", values_to = "RPM")
TOG - Amy Inkster
02:14:51
# now, plot

TRIM62 %>%ggplot(aes(x = sample, y = RPM, colour = TRIM62_value)) +geom_point(size = 2) +scale_colour_manual(values = c("forest green","orange")) +theme_classic() +theme(legend.position = "bottom") +labs(x = "Sample",y = "RPM",title = "Change in TRIM62 expression value before and after normalization",subtitle = "raw vs. RLE Normalized Counts")
TOG - Amy Inkster
02:17:10
# final step, save normalized+filtered exprs dataeNorm <- eNorm %>%rownames_to_column(var = "gene")write_delim(eNorm, file = here::here("data", "eNorm.txt"), delim = "\t")
Martin Adam Prusinkiewicz
02:17:55
Great, thanks
Ali Farrokhi
02:17:56
Thank you
Divya Anantsri
02:18:00
You did a great job!
Mahara Mtawali
02:18:03
Thanks!!
Georgia Whitton
02:18:04
Thanks!
Marina Vineta
02:18:05
Thank you!!!
Adriana Cabrera
02:18:10
Thank you!
Jason Nguyen
02:18:10
thanks
Jiyoung Han
02:18:10
Thank you!
Sofya Langman
02:18:11
thank you!
Kevin Salim
02:18:11
Thanks!
Aurora Mattison
02:18:11
Thank you!!
Divya Anantsri
02:18:14
Thank you!
Yuyin Yi
02:18:18
thank you
TOG - William Casazza
02:18:27
https://github.com/BCCHR-trainee-omics-group/StudyGroup/blob/master/workshops/RNA-seq-Workshop-2021/scripts/TOG_RNAseqWorkshop2021_Part_1.md
TOG - William Casazza
02:18:38
https://github.com/BCCHR-trainee-omics-group/StudyGroup
TOG - William Casazza
02:18:59
https://bcchr.ca/tog
Martin Adam Prusinkiewicz
02:19:23
bye