Big Data Transfers

Data Pipelines

So you want to be a data engineer or data scientist. Welcome to the club, but now it’s time to get hazed or get educated on the ways of one of the single essential aspects of data; the transfer of massive datasets. I will take you through some of the techniques I deploy in my data arsenal.

We have all heard the incredible tales of engineers slinging big data in this wild west type of technology world that we live in today. Data Engineers (I’ll call them DEs because I’m keystroke lazy) are the backbone to research, analytics, and any worthy data science team; without DEs, data science doesn’t exist. Many analogies about data engineers flood my mind; DEs are like the bass player in a rock band delivering deep melodic beats with a mysterious complexity that drives a band to greatness or causes its demise. Data Engineers are the Dwight Schrute to the Michael Scott in The Office. Still, alas, so many people want to be the lead singer (data scientist) that they forget about the majesty of being in the most transcendent support role, being a rock-star data engineer. People often ask me about what is the first step to being a data engineer. My answer is almost always possessing a robust understanding of data transfer techniques. With so many options available, I’ll detail the techniques that I deploy on the regular.

Globus, the rich, lazy man’s file transfer. The biggest pro is that Globus can easily transfer massive files and quickly send these files to the specified endpoints. The lousy news is that Globus is NOT cheap. For example, it cost close to $8k annually per endpoint on an enterprise system. If I have an endpoint on my work cluster, that is $8k, and if I have an endpoint duping into AWS, that is another $8k. The Globus solution will cost restrictive; however, if you’re midsize to a more prominent company, don’t kick the can around; use Globus. Globus offers a web interface with simple drag and drops features that even the seasoned data engineer can get gitty about. When servers fail, solar flairs burst, or some jackwagon trips over the power cord, Globus will pick back up from where it left off. I have onboard capabilities to ensure that every bit and byte are delivered accordingly. Globus is the ultimate set-it and forget-it. A bonus to Globus is that personal endpoints are FREE (at least as of the time I’m punching these keys).

If you haven’t heard of RClone, then you’re missing out. RClone often beats rsync, scp, lftp, and other command-line tools by a mile. Unlike Globus, it will allow us to transfer files to many locations, not just sites with endpoints such as Google Drive or Dropbox, or other cloud-based solutions. RClone has become my gold standard for data file transfers. It does a fantastic job of ensuring that all the data makes it from the source to the destination by deploying checksums on the front and back.

Additionally, RClone will restart if the connection is lost. RClone has some neat tricks up its sleeves, including the ability to copy/move/sync files between systems like GCP/AWS and tools like Dropbox/Box/Google Drive seamlessly. This tool isn’t for the faint of heart; you need some Linux chops; after fumbling through the configuration, the transfers are pretty easy.

Using rclone locally

On Windows: download from here.

On Mac: install rclone with:

curl https://rclone.org/install.sh | sudo bash

Configuring Rclone for Google Drive

You must take the following steps to configure Rclone for your Google Drive account. These configuration steps only need to be taken once, and then you can use your new “remote” as a target for copying data.

  • Log in using your credentials to the source server.
  • Type module load rclone && rclone config The following menu will appear.
  • Select n to create a new remote connection to google drive. When prompted, give it a name; in this example, I’ll use smancusoGoogleDrive
  • The next screen will display all the possible connections with the version of RClone installed on the machine. This menu changes between versions, so be sure to select the number corresponding to the “Google Drive.”

Additionally, you may see “Google Cloud Storage” this is NOT “Google Drive”, do not select this option.

  • When prompted for the next two options, leave them blank and hit return/enter.

    • Google Application Client Id - leave blank normally.
    • Google Application Client Secret - leave blank normally.
  • Choose “N” on this prompt “Say N if you are working on a remote or headless machine or Y didn’t work.” Since the Yen cluster is headless, we must be given a unique URL to validate the connection.

  • On the next prompt, RClone provides you with a unique URL that you need to paste into a browser. After logging in with your server credentials, Google Drive will give a code to paste back into the terminal window.

  • Finally, to finish the configuration in the terminal window setup, click “Y” and “enter” to complete the process if you believe everything is properly setup. In the last prompt, hit “q” to quit. You have setup RClone successfully on your Google Drive.

RClone Common Operations

To list connection points

rclone listremotes

Create a remote folder on Google Drive using Rclone. Note this will make the folder within your Google Drive base folder.

rclone mkdir smancusoGoogleDrive:GoogleDriveFolderName

To upload contents of a directory to Google Drive using copy and Rclone.

rclone copy /Path/To/Folder/ smancusoGoogleDrive:GoogleDriveFolderName/

List contents of remote folder on Google Drive

rclone ls smancusoGoogleDrive:GoogleDriveFolderName

Download from remote Google Drive.

rclone copy smancusoGoogleDrive:GoogleDriveFolderName /Path/To/Local/Download/Folder