# applied survival analysis using r exercises

D.B. Survival data, where the primary outcome is time to a specific event, arise in many areas of biomedical research, including clinical trials, epidemiological studies, and studies of animals. We’re going to be using the built-in lung cancer dataset8 that ships with the survival package. Applied Survival Analysis Using R covers the main principles of survival analysis, gives examples of how it is applied, and teaches how to put those principles to use to analyze data using R as a vehicle. Create the survival object if you don’t have it yet, and instead of using summary(), use plot() instead. You can give the summary() function an option for what times you want to show in the results. You may want to make sure that packages on your local machine are up to date. But it could also be the time until a hardware failure in a mechanical system, time until recovery, time someone remains unemployed after losing a job, time until a ripe tomato is eaten by a grazing deer, time until someone falls asleep in a workshop, etc. The Cancer Genome Atlas (TCGA) is a collaboration between the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI) that collected lots of clinical and genomic data across 33 cancer types. Using R’s survival library, it is possible to conduct very in-depth survival analysis’ with a huge amount of flexibility and scope of analysis. Looks like age is very slightly significant when modeled as a continuous variable. It may take up to 1-5 minutes before you receive it. Look at the help for ?colon again. The “KIPAN” cohort (in KIPAN.clinical) is the pan-kidney cohort, consisting of KICH (chromaphobe renal cell carcinoma), KIRC (renal clear cell carcinoma), and KIPR (papillary cell carcinoma). The core survival analysis functions are in the survival package. R: Complete Data Analysis Solutions Learn by doing - solve real-world data analysis problems using the most popular R packages; R Programming Hands-on Specialization for Data Science (Lv1) An in-depth course with hands-on real-world Data Science use-case examples to supercharge your data analysis skills. Please bring your laptop and charger cable to class. 12(3):601-7, 1994.↩, Where “dead” really refers to the occurance of the event (any event), not necessarily death.↩, Predictive Analytics & Forecasting Influenza, Using the survminer package, plot a Kaplan-Meier curve for this analysis with confidence intervals and showing the p-value. Run a Cox PH regression on the cancer type and gender. STATISTICS: AN INTRODUCTION USING R By M.J. Crawley Exercises 12. (New in survminer 0.2.4: the survminer package can now determine the optimal cutpoint for one or multiple continuous variables at once, using the surv_cutpoint() and surv_categorize() functions. The response variable you create with Surv() goes on the left hand side of the formula, specified with a ~. So, let’s load the package and try it out. The KIPAN.clinical has KICH.clinical, KIRC.clinical, and KIPR.clinical all combined. We’re going to use the survivalTCGA() function from the RTCGA package to pull out survival information from the clinical data. You can operate on it just like any other data frame. You can see more options with the help for ?plot.survfit. But, you’ll need to load it like any other library when you want to use it. So, for a categorical variable like sex, going from male (baseline) to female results in approximately ~40% reduction in hazard. Many survival methods are extensions of techniques used in linear regression and categorical data, while other aspects of this field are unique to survival data. The file will be sent to your email address. Offered by Imperial College London. Let’s look at breast cancer, ovarian cancer, and glioblastoma multiforme. The file will be sent to your Kindle account. This might be death of a biological organism. We’re not going to go into any more detail here, because there’s another package called survminer that provides a function called ggsurvplot() that makes it much easier to produce publication-ready survival plots, and if you’re familiar with ggplot2 syntax it’s pretty easy to modify. But this doesn’t generalize well for assessing the effect of quantitative variables. Take a look at the built in colon dataset. You can get this out of the Cox model with a call to summary(fit). What do you think accounted for this increase in our ability to model survival? The data is now housed at the Genomic Data Commons Portal. There are lots of ways to modify the plot produced by base R’s plot() function. Examples are simple and straightforward while still illustrating key points, shedding light on the application of survival analysis in a way that is useful for graduate students, researchers, and practitioners in biostatistics. But you can reorder this if you want with factor(). survfit() creates a survival curve that you could then display or plot. The interpretation of the hazards ratio depends on the measurement scale of the predictor variable, but in simple terms, a positive coefficient indicates worse survival and a negative coefficient indicates better survival for the variable in question. The hazard is the instantaneous event (death) rate at a particular time point t. Survival analysis doesn’t assume the hazard is constant over time. Take a look at the size of the BRCA.mRNA dataset, show a few rows and columns. You can play fast and loose with how you specify the arguments to Surv. See ?colon for more information about this dataset. Try creating a survival object called s, then display it. You will learn how to find analyze data with a time component and censored data that needs outcome inference. The cumulative hazard is the total hazard experienced up to time t. The survival function, is the probability an individual survives (or, the probability that the event of interest does not occur) up to and including time t. It’s the probability that the event (e.g., death) hasn’t occured yet. You will learn a few techniques for Time Series Analysis and Survival Analysis. From these tables we can start to see that males tend to have worse survival than females. You could then reassign lung to the as_tibble()-ified version. Focus on survival analysis and RNA-seq data. Click “Chemotherapy for Stage B/C colon cancer”, or be specific with ?survival::colon. Using survfit(Surv(..., ...,)~..., data=colondeath), create a survival curve separately for males versus females. Survival analysis lets you analyze the rates of occurrence of events over time, without assuming the rates are constant. You must complete the setup here prior to class. It provides guidance on how to use SPSS, MATLAB, STATISTICA and R in statistical analysis applications without having to delve in the manuals. When there are so many tools and techniques of prediction modelling, why do we have another field known as survival analysis? Now that we’ve fit a survival curve to the data it’s pretty easy to visualize it with a Kaplan-Meier plot. But, how you make that cut is meaningful! If you type ?colon it’ll ask you if you wanted help on the colon dataset from the survival package, or the colon operator. The alternative lets you specify interval data, where you give it the start and end times (time and time2). See the help for ?expressionsTCGA. Kaplan-Meier curves are good for visualizing differences in survival between two categorical groups,4 but they don’t work well for assessing the effect of quantitative variables like age, gene expression, leukocyte count, etc. Let’s pull out data for PAX8, GATA-3, and the estrogen receptor genes from breast, ovarian, and endometrial cancer, and plot the expression of each with a box plot. The help tells you that when there are two unnamed arguments, they will match time and event in that order. Course materials for learning how to perform applied cost-effectiveness analysis with R - hesim-dev/rcea. One thing you might see here is an attempt to categorize a continuous variable into different groups – tertiles, upper quartile vs lower quartile, a median split, etc – so you can make the KM plot. This series of exercises reviews some of the ... epidemiologic scenario taken from Tomas Aragon’s book "Applied Epdemiology Using R". There are lots of ways to access TCGA data without actually downloading and parsing through the data from GDC. Applied Survival Analysis Using R covers the main principles of survival analysis, gives examples of how it is applied, and teaches how to put those principles to use to analyze data using R as a vehicle. The help tells us there are 10 variables in this data: You can access the data just by running lung, as if you had read in a dataset and called it lung. By default it’s going to treat breast cancer as the baseline, because alphabetically it’s first. The curve is horizontal over periods where no event occurs, then drops vertically corresponding to a change in the survival function at each time an event occurs. The core functions we’ll use out of the survival package include: Other optional functions you might use include: Surv() creates the response variable, and typical usage takes the time to event,7 and whether or not the event occured (i.e., death vs censored). The form of the Cox PH model is: $log(h(t)) = log(h_0(t)) + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p$. You can get some more information about the dataset by running ?lung. Survival data, where the primary outcome is time to a specific event, arise in many areas of biomedical research, including clinical trials, epidemiological studies, and studies of animals. But, in longitudinal studies where you track samples or subjects from one time point (e.g., entry into a study, diagnosis, start of a treatment) until you observe some outcome event (e.g., death, onset of disease, relapse), it doesn’t make sense to assume the rates are constant. For example, you might want to simultaneously examine the effect of race and socioeconomic status, so as to adjust for factors like income, access to care, etc., before concluding that ethnicity influences some outcome. If you exponentiate both sides of the equation, and limit the right hand side to just a single categorical exposure variable ($$x_1$$) with two groups ($$x_1=1$$ for exposed and $$x_1=0$$ for unexposed), the equation becomes: $h_1(t) = h_0(t) \times e^{\beta_1 x_1}$. Which has the worst prognosis? ... use_rcea(" ~/Projects/rcea-exercises ") Tutorials. The Kaplan-Meier curve illustrates the survival function. This plot is substantially more informative by default, just because it automatically color codes the different groups, adds axis labels, and creates and automatic legend. Applied Survival Analysis Using R covers the main principles of survival analysis, gives examples of how it is applied, and teaches how to put those principles to use to analyze data using R as a vehicle. You’ll also notice there’s a p-value on the sex term, and a p-value on the overall model. eBook File: Applied-survival-analysis-using-r.PDF Book by Dirk F. Moore, Applied Survival Analysis Using R Books available in PDF, EPUB, Mobi Format. Survival analysis doesn’t assume that the hazard is constant, but does assume that the ratio of hazards between groups is constant over time.3 This class does not cover methods to deal with non-proportional hazards, or interactions of covariates with the time to event. Cox PH regression models the natural log of the hazard at time t, denoted $$h(t)$$, as a function of the baseline hazard ($$h_0(t)$$) (the hazard for an individual where all exposure variables are 0) and multiple exposure variables $$x_1$$, $$x_1$$, $$...$$, $$x_p$$. Do males or females appear to fair better over this time period? Run a summary() on this object, showing time points 0, 500, 1000, 1500, and 2000. This model shows that the hazard ratio is $$e^{\beta_1}$$, and remains constant over time t (hence the name proportional hazards regression). [Intermediate] Spatial Data Analysis with R, QGIS… Is it significant? This class will provide hands-on instruction and exercises covering survival analysis using R. Some of the data to be used here will come from The Cancer Genome Atlas (TCGA), where we may also cover programmatic access to TCGA through Bioconductor if time allows. Survival data, where the primary outcome is time to a specific event, arise in many areas of biomedical research, including clinical trials, epidemiological studies, and studies of animals. But there’s a lot more you can do pretty easily here. We’ll cover more of these below. What a mess! The book "Survival Analysis, Techniques for Censored and Truncated Data" written by Klein & Moeschberger (2003) is always the 1st reference I would recommend for the people who are interested in learning, practicing and studying survival analysis. A background in basic linear regression and categorical data analysis, as well as a basic knowledge of calculus and the R system, will help the reader to fully appreciate the information presented. This tells us that compared to the baseline brca group, GBM patients have a ~18x increase in hazards, and ovarian cancer patients have ~5x worse survival. Let’s get the average age in the dataset, and plot a histogram showing the distribution of age. Survival data, where the primary outcome is time to a specific event, arise in many areas of biomedical research, including clinical trials, epidemiological studies, and studies of animals. Look at the help for ?survivalTCGA for more info. Applied Survival Analysis, Second Edition is an ideal book for graduate-level courses in biostatistics, statistics, and epidemiologic methods. Other readers will always be interested in your opinion of the books you've read. It shows the number at risk (number still remaining), and the cumulative survival at that instant. You could see what it looks like as a tibble (prints nicely, tells you the type of variable each column is). Journal of Clinical Oncology. Read reviews from world’s largest community for readers. Let’s fit survival curves separately by sex. Now consider a r.v. Use the same command to examine how many samples you have for each kidney sample type, separately by sex. Take a look at some of the other resources shown below. And we can use that sequence vector with a summary call on sfit to get life tables at those intervals separately for both males (1) and females (2). The result is now marginally significant! SURVIVAL ANALYSIS A great many studies in statistics deal with deaths or with failures of components: the numbers of deaths, the timing of death, and the risks of death to which different classes of individuals are exposed. Survival analysis also goes by reliability theory in engineering, duration analysis in economics, and event history analysis in sociology.↩, This describes the most common type of censoring – right censoring. See the help for ?survfit. That’s because the KM plot is showing the log-rank test p-value. It looks like this, where $$T$$ is the time of death, and $$Pr(T>t)$$ is the probability that the time of death is greater than some time $$t$$. Finally, we’ll also want to load the survminer package, which provides much nicer Kaplan-Meier plots out-of-the-box than what you get out of base graphics. In the medical world, we typically think of survival analysis literally – tracking time until death. . It’s more interesting to run summary on what it creates. First, let’s turn the colon data into a tibble, then filter the data to only include the survival data, not the recurrence data. The only downside to conducting this analysis in R is that the graphics can look very basic, which, whilst fine for a journal article, does not lend itself too well to presentations and posters. The sample is censored in that you only know that the individual survived up to the loss to followup, but you don’t know anything about survival after that.2. Now, let’s fit a survival curve with the survfit() function. New examples and exercises at the end of each chapter; Analyses throughout the text are performed using Stata® Version 9, and an accompanying FTP site contains the data sets used in the book. In order to assess if this informal ﬁnding is reliable, we may perform a log-rank test via You can perform updating in R using … R is one of the main tools to perform this sort of analysis thanks to the survival package. Run a Cox proportional hazards regression model against this. 96,97 In the example, mothers were asked if they would give the presented samples that had been stored for different times to their children. cut() takes a continuous variable and some breakpoints and creats a categorical variable from that. It was then modified for a more extensive training at Memorial Sloan Kettering Cancer Center in March, 2019. Download PDF: Sorry, we are unable to provide the full text but you may find it at the following location(s): http://link.springer.com/conte... (external link) Survival data, where the primary outcome is time to a specific event, arise in many areas of biomedical research, including clinical trials, epidemiological studies, and studies of animals. See the help for ?Surv.↩, Loprinzi et al. The $$\beta$$ values are the regression coefficients that are estimated from the model, and represent the $$log(Hazard\, Ratio)$$ for each unit increase in the corresponding predictor variable. If you keep reading you’ll see how Surv tries to guess how you’re coding the status variable. This course introduces you to additional topics in Machine Learning that complement essential tasks, including forecasting and analyzing censored data. At some point using a categorical grouping for K-M plots breaks down, and further, you might want to assess how multiple variables work together to influence survival. Survival data, where the primary outcome is time to a specific event, arise in many areas of biomedical research, including clinical trials, epidemiological studies, and studies of animals. Similarly, we can assign that to another object called sfit (or whatever we wanted to call it). coxph() implements the regression analysis, and models specified the same way as in regular linear models, but using the coxph() function. Please contact one of the instructors prior to class if you are having difficulty with any of the setup. There are two rows per person, indidicated by the event type (etype) variable – etype==1 indicates that row corresponds to recurrence; etype==2 indicates death. This text employs numerous actual examples to illustrate survival curve estimation, comparison of survivals of different groups, proper accounting for censoring and truncation, model variable selection, and residual analysis.Because explaining survival analysis requires more advanced mathematics than many other statistical topics, this book is organized with basic concepts and most frequently used procedures covered in earlier chapters, with more advanced topics near the end and in the appendices. This includes installing R, RStudio, and the required packages under the “Survival Analysis” heading. But first, let’s look at an R package that provides convenient, direct access to TCGA data. We could continue adding a labels= option here to label the groupings we create, for instance, as “young” and “old”. It actually has several names. Offered by IBM. This happens when you track the sample/subject through the end of the study and the event never occurs. Handouts: Download and print out these handouts and bring them to class: In the class on essential statistics we covered basic categorical data analysis – comparing proportions (risks, rates, etc) between different groups using a chi-square or fisher exact test, or logistic regression. Many of the data sets discussed in the text are available in the accompanying R package “asaur” (for “Applied Survival Analysis Using R”), while others are in other packages. It does this by looking at vital status (dead or alive) and creating a times variable that’s either the days to death or the days followed up before being censored. Let’s just extract the cancer type (admin.disease_code). Show the results using a Kaplan-Meier plot, with confidence intervals and the p-value. Now, that object itself isn’t very interesting. Fit a parametric survival regression model. It’s a special type of vector that tells you both how long the subject was tracked for, and whether or not the event occured or the sample was censored (shown by the +). We currently use R 2.0.1 patched version. Applied Survival Analysis Using R covers the main principles of survival analysis, gives examples of how it is applied, and teaches how to put those principles to use to analyze data using R as a vehicle. Or, recurrence rate of different cancers varies highly over time, and depends on tumor genetics, treatment, and other environmental factors. What’s the effect of gender? It may takes up to 1-5 minutes before you received it. Each of the data packages is a separate package, and must be installed (once) individually. Just try creating a K-M plot for the nodes variable, which has values that range from 0-33. Applied Survival Analysis Using R covers the main principles of survival analysis, gives examples of how it is applied, and teaches how to put those principles to use to analyze data using R as a vehicle. The three earlier courses in this series covered statistical thinking, correlation, linear regression and logistic regression. Proportional hazards regression a.k.a. Finally, we could assign the result of this to a new object in the lung dataset. Applied Survival Analysis Using R covers the main principles of survival analysis, gives examples of how it is applied, and teaches how to put those principles to use to analyze data using R as a vehicle. RTCGA isn’t the only resource providing easy access to TCGA data. Let’s go back to the lung cancer data and run a Cox regression on sex. Also, the x … Proportional hazards assumption: The main goal of survival analysis is to compare the survival functions in different groups, e.g., leukemia patients as compared to cancer-free controls. Let’s call this new object colondeath. Cox regression is asking which of many categorical or continuous variables significantly affect survival.↩, Surv() can also take start and stop times, to account for left censoring. Let’s create a survival curve, visualize it with a Kaplan-Meier plot, and show a table for the first 5 years survival rates. For example: the risk of death after heart surgery is highest immediately post-op, decreases as the patient recovers, then rises slowly again as the patient ages. Call the resulting object sfit. But, you’ll need to load it like any other library when you want to use it. Another way of analysis? The coxph() function uses the same syntax as lm(), glm(), etc. Survival Analysis is a sub discipline of statistics. In some fields it is called event-time analysis, reliability analysis or duration analysis. Notice the test statistic on the likelihood ratio test becomes much larger, and the overall model becomes more significant. The survival package is one of the few “core” packages that comes bundled with your basic R installation, so you probably didn’t need to install.packages() it. The R package(s) needed for this chapter is the survival package. North Central Cancer Treatment Group. Welcome to Survival Analysis in R for Public Health! Remember, you created a colondeath object in the first exercise that only includes survival (etype==2), not recurrence data points. This dataset has survival and recurrence information on 929 people from a clinical trial on colon cancer chemotherapy. In this kind of analysis you implicitly assume that the rates are constant over the period of the study, or as defined by the different groups you defined. Cox regression is the most common approach to assess the effect of different variables on survival. Survival analysis does this by comparing the hazard at different times over the observation period. It will try to guess whether you’re using 0/1 or 1/2 to represent censored vs “dead”, respectively.9. But, as we saw before, we can’t just do this, because we’ll get a separate curve for every unique value of age! Now, what happens when we make a KM plot with this new categorization? Applied Survival Analysis, Second Edition is an ideal book for graduate-level courses in biostatistics, statistics, and epidemiologic methods. This book not only provides comprehensive discussions to the problems we will face when analyzing the time-to-event data, with lots of examples … Interestingly, the Karnofsky performance score as rated by the physician was marginally significant, while the same score as rated by the patient was not. This is the main function we’ll use to create the survival object. How is this different from the lung data? Similar to how survivalTCGA() was a nice helper function to pull out survival information from multiple different clinical datasets, expressionsTCGA() can pull out specific gene expression measurements across different cancer types. Applied Survival Analysis Using R covers the main principles of survival analysis, gives examples of how it is applied, and teaches how to put those principles to use to analyze data using R as a vehicle. Explanatory variables go on the right side. If you followed both groups until everyone died, both survival curves would end at 0%, but one group might have survived on average a lot longer than the other group. Refer to this blog post for more information.). This could also happen due to the sample/subject dropping out of the study for reasons other than death, or some other loss to followup. Prerequisites: Familiarity with R is required (including working with data frames, installing/using packages, importing data, and saving results); familiarity with dplyr and ggplot2 packages is highly recommended. Applied Survival Analysis Using R covers the main principles of survival analysis, gives examples of how it is applied, and teaches how to put those principles to use to analyze data using R as a vehicle.