PySpark

How to connect to Snowflake from AWS EMR using PySpark

As a ETL developer, we need to transport data between different platforms/services. It involves establishing connections between them. Below is one such use-case to connect Snowflake from AWS. Here are steps to securely connect to Snowflake using PySpark – Login to AWS EMR service and connect to Spark with below snowflake connectors pyspark --packages net.snowflake:snowflake-jdbc:3.11.1,net.snowflake:spark-snowflake_2.11:2.5.7-spark_2.4 Assumption for this article is that secret key is already created in AWS secrets manager service with SnowFlake credentials. In this example, consider the secret key is ‘test/snowflake/cluster’ Using boto3 library connect to AWS secrets manager and extract the snowflake credentials into json object. Sample code snippet below – def ge...

Handy Methods in SparkContext Object while writing Spark Applications

SparkContext is Main entry point for Spark functionality. Its basically a class in Spark framework, when initialized, gets access to Spark Libraries. A SparkContext is responsible for connecting to Spark cluster, and can be used to create RDD(Resilient Distributed Dataset), to broadcast variables on that cluster and has much more useful methods. To create or initialize Spark Context, SparkConf need to be created before hand. SparkConf is basically the class used to set some configurations for Spark Applications like setting Master, App Name etc. Creating SparkContext- from pyspark import SparkConf, SparkContext conf = SparkConf().set("master", "yarn") sc = SparkContext(conf=conf) In latest versions of Spark, sparkContext is available in SparkSession (Class in Spark SQL component/Main entry...

PySpark – Components

PySpark Core Components includes – Spark Core – All functionalities built on top of Spark Core. Contains classes like SparkContext, RDD Spark SQL – Gives API for structured data processing. Contains important classes like SparkSession, DataFrame, DataSet. Spark Streaming – Gives functionality for Streaming data processing using micro-batching technique. Contains classes like Streaming Context, DStream Spark ML – Provides API to implement Machine learning algorithms.

Lost Password

Register

24 Tutorials