Fundamentals of Data Analysis in R Programming (R Cheat Sheet Code Included)

anvilicious

Nigel Lowa

Posted on July 3, 2023

Fundamentals of Data Analysis in R Programming (R Cheat Sheet Code Included)

The code used here is available on GitHub. Simply click on the 'Open in Colab' button to seamlessly run the code.

Now that that's out of the way, before any data analysis can begin, we first need dataset(s). R provides the Iris and BodyFat datasets to anyone who knows where to look.

#TH.data package is needed for access to the BodyFat dataset
install.packages("TH.data", repos = "http://cran.r-project.org")
Enter fullscreen mode Exit fullscreen mode

We get the following output when we import the package and use str() to examine number of variables and entries (columns and rows).

Image description

We can also probe the Iris dataset like so:

Image description

Data import and export in R

Image description

  1. a <- 1:10: This line creates a numeric vector a containing values from 1 to 10. The colon (:) operator is used to generate a sequence of numbers.
  2. save(a, file="./data/dumData.Rdata"): The save() function is used to save the variable a into a file named "dumData.Rdata". The file is saved in the "./data/" directory.
  3. rm(a): The rm() function is used to remove the variable a from the current R session. This means that the variable a is deleted and no longer accessible.
  4. load("./data/dumData.Rdata"): The load() function is used to load the previously saved "dumData.Rdata" file back into the R session. This action reads the file and restores the saved variable a along with its associated values.
  5. print(a): Finally, the print() function is used to display the contents of the variable a. Since a was loaded from the saved file, it contains the values from 1 to 10 that were previously assigned.

The below example demonstrates how to create a dataframe (df1), save it as a CSV file using write.csv(), and then load the dataframe from the file into df2 using read.csv().

Image description

Data Exploration

To examine the size and structure of data, you can use various functions in R. Here are examples that showcase the usage of dim(), names(), str(), and attributes():

Image description

  1. dim() returns the dimensions (number of rows and columns) of an object, such as a matrix or dataframe.
  2. names() retrieves the names of the variables or columns in an object, such as a dataframe.
  3. str() provides the structure of an object, displaying the data type and overall structure of the variables or columns.
  4. attributes() retrieves the attributes associated with an object, which can include additional information or metadata about the data.

We can also retrieve the first and last rows of data using head() and tail() methods

Image description

Image description

Analyzing a Specific Variable

You can use the summary() function to examine the distribution of each numeric variable in your data. It provides important summary statistics such as the minimum, maximum, mean, median, and quartiles (25% and 75%). Additionally, for factors or categorical variables, summary() displays the frequency of each level or category.

Image description

You can obtain the mean, median, and range of a variable using the functions mean(), median(), and range(), respectively. Additionally, if you need to calculate quartiles or percentiles, you can use the quantile() function.

To assess the variance of the Sepal.Length variable, you can use the var() function. Furthermore, you can examine its distribution by creating a histogram and density plot. For the histogram, you can utilize the hist() function, while the density() function enables you to generate a density plot.

Image description

Image description

The evidence suggest that the variable is normally distributed with a relatively low deviation from the mean.

You can also determine the frequency of factors in a dataset using the table() function. Once you have calculated the frequencies, you can visualize them using either a pie chart created with the pie() function or a bar chart created with the barplot() function.

Image description

Image description

Explore Multiple Variables

Once we have examined the distributions of individual variables, the next step is to explore the relationships between two variables. To accomplish this, we can calculate the covariance and correlation between the variables using the cov() and cor() functions, respectively. The cov() function provides the covariance, which measures the linear association between variables, while the cor() function calculates the correlation, which measures the strength and direction of the linear relationship between variables.

Image description
To visualize the distribution of a variable, we can employ the boxplot() function, which generates a box plot, also known as a box-and-whisker plot. This plot displays key statistical measures, including the median, first quartile (25th percentile), third quartile (75th percentile), and any outliers present.

The median is represented by a horizontal line within the box, while the box itself represents the interquartile range (IQR), indicating the range between the 25th and 75th percentiles. Outliers, if present, are depicted as individual points beyond the whiskers.

Essentially, a box plot provides a concise summary of the central tendency, spread, and presence of outliers in a distribution.

Image description

Image description

aggregate() provides a way to generate the descriptive stats (mean, 1st quartile, median, mean, 3rd quartile, and max, respectively) for existing variables as shown above.

To create a scatter plot for two numeric variables in R, you can use the plot() function. By using the with() function, you can avoid the need to explicitly add "iris$" before variable names. Additionally, in the provided code snippet, the colors (col) and symbols (pch) of the data points are set based on the Species variable.

Image description

In situations where there are numerous data points, it is possible for some of them to overlap. To address this issue, we can employ the jitter() function to introduce a slight amount of randomness or noise to the data prior to plotting.

Image description

Bringing it all together

In this section, I have focused mainly on 'zero to minimum viable descriptive analysis' code. In part 2, I will double down on the more complex visualization techniques. However, the code in this article should be enough to generate a comprehensive data analysis report.

💖 💪 🙅 🚩
anvilicious
Nigel Lowa

Posted on July 3, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related