The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application especially for each one. (base) ashish@ashish-vBox:/usr/local/spark$ ./bin/spark-submit --master yarn examples/src/main/python/pi.py 100 Here, the number '100' is signifying the number of partitions. --master: The master URL for the cluster (e.g. spark://23.195.26.187:7077) If it is set to "yarn": Connect to a YARN cluster in client or cluster mode depending on the value of --deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. If it is set to "local": Run Spark locally with one worker thread (i.e. no parallelism at all). Ref: https://spark.apache.org/docs/latest/submitting-applications.html Logs that you get based on the "partitions" argument: 2020-02-28 22:14:36,949 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0) 2020-02-28 22:14:37,772 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1) 2020-02-28 22:14:37,958 INFO executor.Executor: Running task 2.0 in stage 0.0 (TID 3) ... 2020-02-28 22:14:51,031 INFO executor.Executor: Running task 99.0 in stage 0.0 (TID 99) IF YOU TRY TO EXECUTE PYSPARK CODE WITH "MASTER" ARGUMENT SET TO "yarn" BUT YARN NOT INSTALLED, YOU WILL GET FOLLOWING ERRORS: (base) ashish@ashish-vBox:/usr/local/spark$ ./bin/spark-submit --master yarn examples/src/main/python/pi.py 100 2020-02-28 22:45:16,988 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 2020-02-28 22:45:18,272 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2020-02-28 22:45:19,281 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) ... The port "8032" for YARN is read from the property "yarn.resourcemanager.address". [Need citation]
Launching a PySpark program using "spark-submit"
Subscribe to:
Posts (Atom)
No comments:
Post a Comment