How To Merge Tables In R

How to merge data in R using R merge, dplyr, or data.table

See how to join two data sets by one or more than mutual columns using base of operations R's merge office, dplyr join functions, and the speedy data.tabular array parcel.

Sharon Machlis By

Executive Editor, Data & Analytics, InfoWorld |

How to merge data in R using R merge, dplyr, or data.table — Thinkstock

R has a number of quick, elegant means to join data frames by a common column. I'd like to show you lot three of them:

base of operations R's merge() function
dplyr's join family unit of functions
data.tabular array's bracket syntax

Get and import the data

For this example I'll use one of my favorite demo data sets—flight filibuster times from the U.Southward. Agency of Transportation Statistics. If you want to follow forth, caput to http://bit.ly/USFlightDelays and download data for the time frame of your pick with the columns Flight Engagement, Reporting_Airline, Origin, Destination, and DepartureDelayMinutes. Also get the lookup table for Reporting_Airline.

Or, you can download these two data sets—plus my R code in a single file and a PowerPoint explaining different types of information merges—here:

download

Includes R scripts, several information files, and a PowerPoint to accompany the InfoWorld tutorial. Sharon Machlis

To read in the file with base R, I'd commencement unzip the flight delay file and then import both flight delay information and the code lookup file with read.csv(). If you lot're running the code, the filibuster file you downloaded will likely have a different name than in the code beneath. Also, note the lookup file's unusual .csv_ extension.

                              unzip("673598238_T_ONTIME_REPORTING.nil")
mydf <- read.csv("673598238_T_ONTIME_REPORTING.csv",                
                sep = ",", quote="\"")
mylookup <- read.csv("L_UNIQUE_CARRIERS.csv_",                
                quote="\"", sep = "," )

Side by side, I'll take a peek at both files with head():

                              head(mydf)      FL_DATE OP_UNIQUE_CARRIER ORIGIN DEST DEP_DELAY_NEW  X 1 2019-08-01                DL    ATL  DFW            31 NA 2 2019-08-01                DL    DFW  ATL             0 NA three 2019-08-01                DL    IAH  ATL            40 NA 4 2019-08-01                DL    PDX  SLC             0 NA 5 2019-08-01                DL    SLC  PDX             0 NA 6 2019-08-01                DL    DTW  ATL            ten NAhead(mylookup)   Code                                          Clarification 1  02Q                                        Titan Airways 2  04Q                                   Tradewind Aviation 3  05Q                                  Comlux Aviation, AG four  06Q                        Master Top Linhas Aereas Ltd. five  07Q                                  Flair Airlines Ltd. half-dozen  09Q Swift Air, LLC d/b/a Eastern Air Lines d/b/a Eastern

Merges with base R

The mydf delay data frame only has airline information by code. I'd like to add a cavalcade with the airline names from mylookup. One base of operations R way to do this is with the merge() function, using the basic syntax merge(df1, df2). The guild of data frame 1 and data frame 2 doesn't matter, but whichever one is first is considered x and the second one is y.

If the columns you lot desire to bring together by don't have the same name, yous need to tell merge which columns you want to join by: by.x for the x data frame column name, and by.y for the y ane, such as merge(df1, df2, past.x = "df1ColName", past.y = "df2ColName").

You can also tell merge whether you want all rows, including ones without a match, or just rows that match, with the arguments all.x and all.y. In this case, I'd like all the rows from the delay data; if in that location's no airline code in the lookup table, I notwithstanding desire the information. Only I don't need rows from the lookup table that aren't in the delay information (there are some codes for former airlines that don't fly anymore in in that location). Then, all.x equals TRUE but all.y equals FALSE. Hither'due south the code:

                              joined_df <- merge(mydf, mylookup, by.x = "OP_UNIQUE_CARRIER",                
                past.y = "Code", all.ten = TRUE, all.y = Faux)

The new joined information frame includes a cavalcade called Description with the name of the airline based on the carrier code:

                              head(joined_df)   OP_UNIQUE_CARRIER    FL_DATE ORIGIN DEST DEP_DELAY_NEW  X       Description ane                9E 2019-08-12    JFK  SYR             0 NA Endeavor Air Inc. ii                9E 2019-08-12    TYS  DTW             0 NA Endeavor Air Inc. 3                9E 2019-08-12    ORF  LGA             0 NA Endeavor Air Inc. 4                9E 2019-08-13    IAH  MSP             6 NA Attempt Air Inc. 5                9E 2019-08-12    DTW  JFK            58 NA Endeavor Air Inc. 6                9E 2019-08-12    SYR  JFK             0 NA Endeavor Air Inc.

Joins with dplyr

The dplyr package uses SQL database syntax for its bring together functions. A left join means: Include everything on the left (what was the 10 data frame in merge()) and all rows that match from the right (y) information frame. If the join columns have the same name, all yous need is left_join(ten, y). If they don't have the same proper name, you lot demand a past argument, such equally left_join(x, y, by = c("df1ColName" = "df2ColName")).

Notation the syntax for past: It's a named vector, with both the left and right cavalcade names in quotation marks.

Update: The development version of dplyr has an additional by syntax:

                              left_join(10, y,                by = join_by(df1ColName == df2ColName))

Instead of a named vector with quoted column names, the new join_by() part uses unquoted column names and the == boolean operator.

If you'd similar to try this out, you can install the dplyr dev version (1.0.99.90 as of this writing) with either

              devtools::install_github("tidyverse/dplyr")

              remotes`::install_github("tidyverse/dplyr")

The code to import and merge both data sets using left_join() is beneath. It starts by loading the dplyr and readr packages, so reads in the ii files with read_csv(). When using read_csv(), I don't need to unzip the file first.

                              library(dplyr)
library(readr)

                              mytibble <- read_csv("673598238_T_ONTIME_REPORTING.aught")
mylookup_tibble <- read_csv("L_UNIQUE_CARRIERS.csv_")

                              joined_tibble <- left_join(mytibble, mylookup_tibble,                
                by = c("OP_UNIQUE_CARRIER" = "Code"))

read_csv() creates tibbles, which are a blazon of data frame with some extra features. left_join() merges the two. Take a await at the syntax: In this case, guild matters. left_join() means include all rows on the left, or outset, data set, but only rows that match from the 2d one. And, because I need to join by two differently named columns, I included a past statement.

The new join syntax in the development-just version of dplyr would exist:

                              joined_tibble2 <- left_join(mytibble, mylookup_tibble,                
                by = join_by(OP_UNIQUE_CARRIER == Code))

Since most people likely accept the CRAN version, however, I will utilize dplyr'southward original named-vector syntax in the residue of this article, until join_by() becomes role of the CRAN version.

We can look at the structure of the result with dplyr's glimpse() function, which is another mode to run into the top few items of a data frame:

                              glimpse(joined_tibble) Observations: 658,461 Variables: 7 $ FL_DATE           <date> 2019-08-01, 2019-08-01, 2019-08-01, 2019-08-01, 2019-08-01… $ OP_UNIQUE_CARRIER <chr> "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL",… $ ORIGIN            <chr> "ATL", "DFW", "IAH", "PDX", "SLC", "DTW", "ATL", "MSP", "JF… $ DEST              <chr> "DFW", "ATL", "ATL", "SLC", "PDX", "ATL", "DTW", "JFK", "MS… $ DEP_DELAY_NEW     <dbl> 31, 0, twoscore, 0, 0, ten, 0, 22, 0, 0, 0, 17, v, 2, 0, 0, 8, 0, … $ X6                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ Description       <chr> "Delta Air Lines Inc.", "Delta Air Lines Inc.", "Delta Air …

This joined data set now has a new column with the name of the airline. If you run a version of this code yourself, you lot'll probably notice that dplyr is way faster than base R.

Next, let's wait at a super-fast way to do joins.

Source: https://www.infoworld.com/article/3454356/how-to-merge-data-in-r-using-r-merge-dplyr-or-datatable.html

How To Merge Tables In R

How to merge data in R using R merge, dplyr, or data.table

See how to join two data sets by one or more than mutual columns using base of operations R's merge office, dplyr join functions, and the speedy data.tabular array parcel.

Get and import the data

Merges with base R

Joins with dplyr

0 Response to "How To Merge Tables In R"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel