Cheat Sheet Dplyr



  1. Data Wrangling Dplyr Cheat Sheet
  2. Rstudio Dplyr Cheat Sheet
  3. Dplyr Cheat Sheet Deutsch
  4. Rstudio Dplyr Cheat Sheet

Data Transformation Cheatsheet dplyr provides a grammar for manipulating tables in R. This cheatsheet will guide you through the grammar, reminding you how to select, filter, arrange, mutate, summarise, group, and join data frames and tibbles. (Previous version) Updated January 17. The Data Transformation cheat sheet is a classic example of a straightforward mnemonic tool. It is likely that even someone who just beginning to work with dplyr will immediately grok that it organizes functions that manipulate tidy data. The cognitive load then is to remember how functions are grouped by task. To work with a database in dplyr, you must first connect to it, using DBI::dbConnect. We’re not going to go into the details of the DBI package here, but it’s the foundation upon which dbplyr is built. You’ll need to learn more about if you need to do things to the database that are beyond the scope of dplyr. The tidyverse cheat sheet will guide you through some general information on the tidyverse, and then covers topics such as useful functions, loading in your data, manipulating it with dplyr and lastly, visualize it with ggplot2. In short, everything that you need to kickstart your data science learning with R! Do you want to learn more? Practice Problems using dplyr 5 ) Onto a new example: Calculate the total births in each maternal race/ethnicity group (WnH, AA, WH, AI/AN, Other). For each group, also calculate the prevalence of early care and prevalence of preterm birth.

4 min readRstudio2020/04/16

Motivation

I use R to access data held in Microsoft SQL Server databases on a daily basis. As a result of running into problems, I’ve realized I don’t have an understanding of the roles different components, notably dplyr, dbplyr, odbc and dbi each play in the process. This contributes to the fact that my efforts to resolve or mitigate issues are often an inefficient combination of Google searches and trial and error. Additionally, as I find or develop workarounds, I am unwilling to promote them with others because I don’t fully understand the cause of the issue and, as a result, I am not confident I have addressed the problem at the appropriate level.

Specifically, this is most motivated by the, “Invalid descriptor index,” error documented here.

I am writing what I learn to solidly my thinking and to help others who may experience the same challenge.

Scope

Given that the source of my motivation is encountering problems when working with data in Microsoft SQL Server databases, and that I prefer to use packages in the tidyverse, this investigation will be focused on how these packages work together to collect data from databases managed by Microsoft SQL server.

I won’t be writing about other options like RODBC, RJDBC or database-specific packages.

I will also not include comments about database management systems (DBMS) other than Microsoft SQL Server.

Data wrangling dplyr cheat sheet

While it’s possible to generalize many of the concepts I write about here to other DBMS systems, I will not explicitly call them out. There are plenty of resources that do that. I aim to be very focused on how these components interact in a tidyverse and Microsoft SQL Server environment in the hopes it will help paint a simpler, clearer picture for others working in that same configuration.

Additionally, I almost never write data to a DBMS, and I suspect this is the case for many people working as analysts in Enterprise environments. In light of that I will be focused on how these components work together to extract data from SQL, and not how they write data to it.

Role of dplyr and dbplyr packages

dbplyr is the database back-end for dplyr - it does not need to be loaded explicitly, it is loaded by dplyr when working with data in a database.

dbplyr translates dplyr syntax into Microsoft SQL Server specific SQL code so dplyr can be used to retrieve data from a database system without the need to write SQL code.

dbplyr relies on the DBI and odbc packages as an intermediaries for connections with a SQL Server database.

Dplyr can also pass on explicitly written (not translated from dplyr) SQL code to DBI.

dbplyr generates, or captures, the SQL code that is then passed into the front-end of the database stack provided by DBI, odbc and the ODBC driver

Data Wrangling Dplyr Cheat Sheet

Role of the DBI package

DBI segments the connectivity to the SQL database into a, “front-end,” and a, “back-end.”

DBI implements a standardized front-end to dbplyr, and the odbc package acts as a driver for DBI to interface with SQL Server.

An example of front-end functionality provided by DBI…

  • connect/disconnect to the database
  • create and execute statements in the DBMS
  • extract results/output from statements
  • error/exception handling
  • information (meta-data) from database objects
  • transaction management (optional)

I think of the DBI package as the front end for interactive user at the R console, a script or package, into the other components, odbc and an ODBC driver, that make it possible to extract data from SQL server.

Role of the odbc package

The odbc package provides the DBI back-end to any odbc driver connection, including those for Microsoft SQL Server.

This enables a connection to any database with ODBC drivers available.

I think of the odbc package as the “back-end” of DBI and the “front-end” into the ODBC driver.

Role of ODBC drivers

Open Database Connectivity (ODBC) drivers are the last leg of the link between dplyr and SQL Server. They are what enable the odbc package to interface with SQL server.

I think of the SQL Server ODBC driver as the “front-end” into SQL server, and again, back to the user, script or package

Summary

  • Dbplyr translates dplyr syntax into SQL code, or captures explicitly written SQL, and hands it off to DBI
  • DBI uses odbc to interface with the SQL Server ODBC driver
  • ODBC driver communicates with SQL server and retrieves results

Rstudio Dplyr Cheat Sheet

User or package code -> DBI -> odbc -> SQL Server ODBC driver -> SQL server

Sources and notes

Dplyr Cheat Sheet Deutsch

I haven’t written anything new here, just focused it on the configuration I use day to day in the hopes it helps someone else.

Most was gleaned from the following and I’d recommend reviewing them for a broader perspective, and deeper insights into specific areas:

Rstudio Dplyr Cheat Sheet

  • vignette('DBI', package = 'DBI')
  • vignette('dbplyr', 'package = 'dbplyr)