Friday, October 7, 2022

Spark Installation on Windows (2022-Oct-07, Status Failure, Part 1)

$ conda install pyspark -c conda-forge

---------------------------

Test

In Jupyter Lab

import pyspark - - - - - TypeError Traceback (most recent call last) Input In [9], in <cell line: 1>() ----> 1 import pyspark File ~\Anaconda3\lib\site-packages\pyspark\__init__.py:51, in <module> 48 import types 50 from pyspark.conf import SparkConf ---> 51 from pyspark.context import SparkContext 52 from pyspark.rdd import RDD, RDDBarrier 53 from pyspark.files import SparkFiles File ~\Anaconda3\lib\site-packages\pyspark\context.py:31, in <module> 27 from tempfile import NamedTemporaryFile 29 from py4j.protocol import Py4JError ---> 31 from pyspark import accumulators 32 from pyspark.accumulators import Accumulator 33 from pyspark.broadcast import Broadcast, BroadcastPickleRegistry File ~\Anaconda3\lib\site-packages\pyspark\accumulators.py:97, in <module> 95 import socketserver as SocketServer 96 import threading ---> 97 from pyspark.serializers import read_int, PickleSerializer 100 __all__ = ['Accumulator', 'AccumulatorParam'] 103 pickleSer = PickleSerializer() File ~\Anaconda3\lib\site-packages\pyspark\serializers.py:71, in <module> 68 protocol = 3 69 xrange = range ---> 71 from pyspark import cloudpickle 72 from pyspark.util import _exception_message 75 __all__ = ["PickleSerializer", "MarshalSerializer", "UTF8Deserializer"] File ~\Anaconda3\lib\site-packages\pyspark\cloudpickle.py:145, in <module> 125 else: 126 return types.CodeType( 127 co.co_argcount, 128 co.co_kwonlyargcount, (...) 141 (), 142 ) --> 145 _cell_set_template_code = _make_cell_set_template_code() 148 def cell_set(cell, value): 149 """Set the value of a closure cell. 150 """ File ~\Anaconda3\lib\site-packages\pyspark\cloudpickle.py:126, in _make_cell_set_template_code() 109 return types.CodeType( 110 co.co_argcount, 111 co.co_nlocals, (...) 123 (), 124 ) 125 else: --> 126 return types.CodeType( 127 co.co_argcount, 128 co.co_kwonlyargcount, 129 co.co_nlocals, 130 co.co_stacksize, 131 co.co_flags, 132 co.co_code, 133 co.co_consts, 134 co.co_names, 135 co.co_varnames, 136 co.co_filename, 137 co.co_name, 138 co.co_firstlineno, 139 co.co_lnotab, 140 co.co_cellvars, # this is the trickery 141 (), 142 ) TypeError: an integer is required (got type bytes)

In Python CLI

(base) C:\Users\ashish>python Python 3.9.12 (main, Apr 4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import pyspark Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Users\ashish\Anaconda3\lib\site-packages\pyspark\__init__.py", line 51, in <module> from pyspark.context import SparkContext File "C:\Users\ashish\Anaconda3\lib\site-packages\pyspark\context.py", line 31, in <module> from pyspark import accumulators File "C:\Users\ashish\Anaconda3\lib\site-packages\pyspark\accumulators.py", line 97, in <module> from pyspark.serializers import read_int, PickleSerializer File "C:\Users\ashish\Anaconda3\lib\site-packages\pyspark\serializers.py", line 71, in <module> from pyspark import cloudpickle File "C:\Users\ashish\Anaconda3\lib\site-packages\pyspark\cloudpickle.py", line 145, in <module> _cell_set_template_code = _make_cell_set_template_code() File "C:\Users\ashish\Anaconda3\lib\site-packages\pyspark\cloudpickle.py", line 126, in _make_cell_set_template_code return types.CodeType( TypeError: an integer is required (got type bytes) >>> exit() --------------------------- (base) C:\Users\ashish>conda uninstall pyspark Collecting package metadata (repodata.json): done Solving environment: done ## Package Plan ## environment location: C:\Users\ashish\Anaconda3 removed specs: - pyspark The following packages will be REMOVED: py4j-0.10.7-py_1 pyspark-2.4.4-py_0 python_abi-3.9-2_cp39 The following packages will be SUPERSEDED by a higher-priority channel: conda conda-forge::conda-22.9.0-py39hcbf530~ --> pkgs/main::conda-22.9.0-py39haa95532_0 None Proceed ([y]/n)? y Preparing transaction: done Verifying transaction: done Executing transaction: done ---------------------------

Re-installing via pip3

(base) C:\Users\ashish>pip3 install pyspark Collecting pyspark Downloading pyspark-3.3.0.tar.gz (281.3 MB) |████████████████████████████████| 281.3 MB 70 kB/s Collecting py4j==0.10.9.5 Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB) |████████████████████████████████| 199 kB 285 kB/s Building wheels for collected packages: pyspark Building wheel for pyspark (setup.py) ... done Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764040 sha256=0684a408679e5a3890611f7025e07d22688f5142c0fe6b90ca535e805b9ae007 Stored in directory: c:\users\ashish\appdata\local\pip\cache\wheels\05\75\73\81f84d174299abca38dd6a06a5b98b08ae25fce50ab8986fa1 Successfully built pyspark Installing collected packages: py4j, pyspark Successfully installed py4j-0.10.9.5 pyspark-3.3.0

Test Through Top Import And Printing Version

(base) C:\Users\ashish>python Python 3.9.12 (main, Apr 4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import pyspark >>> pyspark.__version__ '3.3.0'

Further Testing Through Code

In Jupyter Lab

import pandas as pd from pyspark import SparkContext from pyspark.sql import SQLContext # Main entry point for DataFrame and SQL functionality. df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] }) sc = SparkContext.getOrCreate() sqlCtx = SQLContext(sc) sdf = sqlCtx.createDataFrame(df) - - - - - Py4JJavaError Traceback (most recent call last) Input In [35], in <cell line: 1>() ----> 1 sdf.show() File ~\Anaconda3\lib\site-packages\pyspark\sql\dataframe.py:606, in DataFrame.show(self, n, truncate, vertical) 603 raise TypeError("Parameter 'vertical' must be a bool") 605 if isinstance(truncate, bool) and truncate: --> 606 print(self._jdf.showString(n, 20, vertical)) 607 else: 608 try: File ~\Anaconda3\lib\site-packages\py4j\java_gateway.py:1321, in JavaMember.__call__(self, *args) 1315 command = proto.CALL_COMMAND_NAME +\ 1316 self.command_header +\ 1317 args_command +\ 1318 proto.END_COMMAND_PART 1320 answer = self.gateway_client.send_command(command) -> 1321 return_value = get_return_value( 1322 answer, self.gateway_client, self.target_id, self.name) 1324 for temp_arg in temp_args: 1325 temp_arg._detach() File ~\Anaconda3\lib\site-packages\pyspark\sql\utils.py:190, in capture_sql_exception..deco(*a, **kw) 188 def deco(*a: Any, **kw: Any) -> Any: 189 try: --> 190 return f(*a, **kw) 191 except Py4JJavaError as e: 192 converted = convert_exception(e.java_exception) File ~\Anaconda3\lib\site-packages\py4j\protocol.py:326, in get_return_value(answer, gateway_client, target_id, name) 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) 325 if answer[1] == REFERENCE_TYPE: --> 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", name), value) 329 else: 330 raise Py4JError( 331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n". 332 format(target_id, ".", name, value)) Py4JJavaError: An error occurred while calling o477.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2) (LAPTOPHOST.ad.company.com executor driver): org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: java.net.SocketTimeoutException: Accept timed out at java.base/sun.nio.ch.NioSocketImpl.timedAccept(Unknown Source) at java.base/sun.nio.ch.NioSocketImpl.accept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.platformImplAccept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.accept(Unknown Source) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176) ... 29 more Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:506) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:459) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3868) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2863) at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3858) at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3856) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856) at org.apache.spark.sql.Dataset.head(Dataset.scala:2863) at org.apache.spark.sql.Dataset.take(Dataset.scala:3084) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288) at org.apache.spark.sql.Dataset.showString(Dataset.scala:327) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Unknown Source) Caused by: org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ... 1 more Caused by: java.net.SocketTimeoutException: Accept timed out at java.base/sun.nio.ch.NioSocketImpl.timedAccept(Unknown Source) at java.base/sun.nio.ch.NioSocketImpl.accept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.platformImplAccept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.accept(Unknown Source) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176) ... 29 more

In Python CLI

(base) C:\Users\ashish\Desktop\20221004\code>python Python 3.9.12 (main, Apr 4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> from pyspark import SparkContext >>> from pyspark.sql import SQLContext >>> df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] }) >>> sc = SparkContext.getOrCreate() 22/10/07 15:42:09 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/10/07 15:42:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable >>> sqlCtx = SQLContext(sc) C:\Users\ashish\Anaconda3\lib\site-packages\pyspark\sql\context.py:112: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead. warnings.warn( >>> sdf = sqlCtx.createDataFrame(df) >>> sdf.show() Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases. 22/10/07 15:43:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: java.net.SocketTimeoutException: Accept timed out at java.base/sun.nio.ch.NioSocketImpl.timedAccept(Unknown Source) at java.base/sun.nio.ch.NioSocketImpl.accept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.platformImplAccept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.accept(Unknown Source) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176) ... 29 more 22/10/07 15:43:14 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (LAPTOPHOST.ad.company.com executor driver): org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: java.net.SocketTimeoutException: Accept timed out at java.base/sun.nio.ch.NioSocketImpl.timedAccept(Unknown Source) at java.base/sun.nio.ch.NioSocketImpl.accept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.platformImplAccept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.accept(Unknown Source) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176) ... 29 more 22/10/07 15:43:14 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job Traceback (most recent call last): (0 + 0) / 1] File "<stdin>", line 1, in <module> File "C:\Users\ashish\Anaconda3\lib\site-packages\pyspark\sql\dataframe.py", line 606, in show print(self._jdf.showString(n, 20, vertical)) File "C:\Users\ashish\Anaconda3\lib\site-packages\py4j\java_gateway.py", line 1321, in __call__ return_value = get_return_value( File "C:\Users\ashish\Anaconda3\lib\site-packages\pyspark\sql\utils.py", line 190, in deco return f(*a, **kw) File "C:\Users\ashish\Anaconda3\lib\site-packages\py4j\protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling o46.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (LAPTOPHOST.ad.company.com executor driver): org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: java.net.SocketTimeoutException: Accept timed out at java.base/sun.nio.ch.NioSocketImpl.timedAccept(Unknown Source) at java.base/sun.nio.ch.NioSocketImpl.accept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.platformImplAccept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.accept(Unknown Source) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176) ... 29 more Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:506) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:459) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3868) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2863) at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3858) at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3856) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856) at org.apache.spark.sql.Dataset.head(Dataset.scala:2863) at org.apache.spark.sql.Dataset.take(Dataset.scala:3084) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288) at org.apache.spark.sql.Dataset.showString(Dataset.scala:327) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Unknown Source) Caused by: org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ... 1 more Caused by: java.net.SocketTimeoutException: Accept timed out at java.base/sun.nio.ch.NioSocketImpl.timedAccept(Unknown Source) at java.base/sun.nio.ch.NioSocketImpl.accept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.platformImplAccept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.accept(Unknown Source) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176) ... 29 more >>> ---------------------------

FRESH INSTALLATION USING ENV.YML FILE

(base) C:\Users\ashish\Desktop>conda env create -f env.yml Collecting package metadata (repodata.json): done Solving environment: done Downloading and Extracting Packages smart_open-6.2.0 | 44 KB | ### | 100% executing-1.1.0 | 23 KB | ### | 100% libthrift-0.15.0 | 865 KB | ### | 100% jsonschema-4.16.0 | 65 KB | ### | 100% aws-c-http-0.6.6 | 154 KB | ### | 100% pandas-1.5.0 | 11.7 MB | ### | 100% nbclient-0.7.0 | 65 KB | ### | 100% liblapack-3.9.0 | 5.6 MB | ### | 100% setuptools-65.4.1 | 776 KB | ### | 100% cffi-1.15.1 | 225 KB | ### | 100% aws-c-event-stream-0 | 47 KB | ### | 100% jupyter_console-6.4. | 23 KB | ### | 100% aws-c-s3-0.1.27 | 49 KB | ### | 100% matplotlib-3.6.0 | 8 KB | ### | 100% libpng-1.6.38 | 773 KB | ### | 100% libdeflate-1.14 | 73 KB | ### | 100% gst-plugins-base-1.2 | 2.4 MB | ### | 100% parquet-cpp-1.5.1 | 3 KB | ### | 100% scikit-learn-1.1.2 | 7.5 MB | ### | 100% libutf8proc-2.7.0 | 101 KB | ### | 100% pkgutil-resolve-name | 9 KB | ### | 100% pyspark-3.3.0 | 268.1 MB | ### | 100% kiwisolver-1.4.4 | 61 KB | ### | 100% importlib_resources- | 28 KB | ### | 100% sip-6.6.2 | 523 KB | ### | 100% gensim-4.2.0 | 22.5 MB | ### | 100% ipykernel-6.16.0 | 100 KB | ### | 100% certifi-2022.9.24 | 155 KB | ### | 100% nbformat-5.6.1 | 106 KB | ### | 100% libblas-3.9.0 | 5.6 MB | ### | 100% gflags-2.2.2 | 80 KB | ### | 100% aws-c-mqtt-0.7.8 | 66 KB | ### | 100% jupyter_client-7.3.5 | 91 KB | ### | 100% ipython-8.5.0 | 553 KB | ### | 100% pywin32-303 | 6.8 MB | ### | 100% qtconsole-5.3.2 | 6 KB | ### | 100% notebook-6.4.12 | 6.3 MB | ### | 100% matplotlib-base-3.6. | 7.5 MB | ### | 100% pyarrow-6.0.0 | 2.4 MB | ### | 100% soupsieve-2.3.2.post | 34 KB | ### | 100% glog-0.5.0 | 90 KB | ### | 100% asttokens-2.0.8 | 24 KB | ### | 100% stack_data-0.5.1 | 24 KB | ### | 100% tbb-2021.6.0 | 174 KB | ### | 100% libiconv-1.17 | 698 KB | ### | 100% re2-2021.11.01 | 476 KB | ### | 100% pillow-9.2.0 | 45.4 MB | ### | 100% scipy-1.9.1 | 28.2 MB | ### | 100% grpc-cpp-1.41.1 | 17.6 MB | ### | 100% joblib-1.2.0 | 205 KB | ### | 100% qt-main-5.15.6 | 68.8 MB | ### | 100% aws-crt-cpp-0.17.1 | 191 KB | ### | 100% tqdm-4.64.1 | 82 KB | ### | 100% regex-2022.9.13 | 350 KB | ### | 100% aws-sdk-cpp-1.9.120 | 5.5 MB | ### | 100% python-fastjsonschem | 242 KB | ### | 100% nbconvert-pandoc-7.2 | 5 KB | ### | 100% vs2015_runtime-14.29 | 1.2 MB | ### | 100% libglib-2.74.0 | 3.1 MB | ### | 100% gettext-0.19.8.1 | 4.7 MB | ### | 100% numpy-1.23.3 | 6.3 MB | ### | 100% nbconvert-7.2.1 | 6 KB | ### | 100% jupyter_core-4.11.1 | 106 KB | ### | 100% pywinpty-2.0.8 | 229 KB | ### | 100% aws-c-common-0.6.11 | 165 KB | ### | 100% python-3.10.6 | 16.5 MB | ### | 100% aws-c-auth-0.6.4 | 91 KB | ### | 100% nest-asyncio-1.5.6 | 10 KB | ### | 100% matplotlib-inline-0. | 12 KB | ### | 100% tzdata-2022d | 118 KB | ### | 100% libtiff-4.4.0 | 1.1 MB | ### | 100% libprotobuf-3.18.1 | 2.3 MB | ### | 100% aws-c-io-0.10.9 | 127 KB | ### | 100% libssh2-1.10.0 | 228 KB | ### | 100% debugpy-1.6.3 | 3.2 MB | ### | 100% unicodedata2-14.0.0 | 491 KB | ### | 100% contourpy-1.0.5 | 176 KB | ### | 100% terminado-0.16.0 | 19 KB | ### | 100% pcre2-10.37 | 942 KB | ### | 100% pandoc-2.19.2 | 18.9 MB | ### | 100% prompt-toolkit-3.0.3 | 254 KB | ### | 100% glib-2.74.0 | 452 KB | ### | 100% importlib-metadata-4 | 34 KB | ### | 100% pyrsistent-0.18.1 | 86 KB | ### | 100% mistune-2.0.4 | 67 KB | ### | 100% libcblas-3.9.0 | 5.6 MB | ### | 100% xz-5.2.6 | 213 KB | ### | 100% argon2-cffi-bindings | 34 KB | ### | 100% jupyter-1.0.0 | 7 KB | ### | 100% aws-c-cal-0.5.12 | 36 KB | ### | 100% zlib-1.2.12 | 114 KB | ### | 100% libzlib-1.2.12 | 71 KB | ### | 100% ipywidgets-8.0.2 | 109 KB | ### | 100% traitlets-5.4.0 | 85 KB | ### | 100% jupyterlab_widgets-3 | 222 KB | ### | 100% libsqlite-3.39.4 | 642 KB | ### | 100% jinja2-3.1.2 | 99 KB | ### | 100% zstd-1.5.2 | 401 KB | ### | 100% ca-certificates-2022 | 189 KB | ### | 100% nbconvert-core-7.2.1 | 189 KB | ### | 100% pyzmq-24.0.1 | 461 KB | ### | 100% psutil-5.9.2 | 370 KB | ### | 100% click-8.1.3 | 149 KB | ### | 100% pip-22.2.2 | 1.5 MB | ### | 100% libcurl-7.85.0 | 311 KB | ### | 100% vc-14.2 | 14 KB | ### | 100% markupsafe-2.1.1 | 25 KB | ### | 100% pyqt-5.15.7 | 4.7 MB | ### | 100% arrow-cpp-6.0.0 | 15.7 MB | ### | 100% prompt_toolkit-3.0.3 | 5 KB | ### | 100% pygments-2.13.0 | 821 KB | ### | 100% bleach-5.0.1 | 124 KB | ### | 100% jedi-0.18.1 | 799 KB | ### | 100% py4j-0.10.9.5 | 181 KB | ### | 100% aws-c-compression-0. | 20 KB | ### | 100% gstreamer-1.20.3 | 2.2 MB | ### | 100% tornado-6.2 | 666 KB | ### | 100% statsmodels-0.13.2 | 10.4 MB | ### | 100% fonttools-4.37.4 | 1.7 MB | ### | 100% pyqt5-sip-12.11.0 | 82 KB | ### | 100% widgetsnbextension-4 | 1.6 MB | ### | 100% attrs-22.1.0 | 48 KB | ### | 100% pytz-2022.4 | 232 KB | ### | 100% qtconsole-base-5.3.2 | 91 KB | ### | 100% qtpy-2.2.1 | 49 KB | ### | 100% glib-tools-2.74.0 | 168 KB | ### | 100% aws-checksums-0.1.12 | 51 KB | ### | 100% Preparing transaction: done Verifying transaction: done Executing transaction: done Installing pip dependencies: | Ran pip subprocess with arguments: ['C:\\Users\\ashish\\Anaconda3\\envs\\mh\\python.exe', '-m', 'pip', 'install', '-U', '-r', 'C:\\Users\\ashish\\Desktop\\condaenv.ug1b_vpf.requirements.txt'] Pip subprocess output: Collecting rpy2==3.4.5 Using cached rpy2-3.4.5.tar.gz (194 kB) Preparing metadata (setup.py): started Preparing metadata (setup.py): finished with status 'done' Requirement already satisfied: cffi>=1.10.0 in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.ug1b_vpf.requirements.txt (line 1)) (1.15.1) Requirement already satisfied: jinja2 in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.ug1b_vpf.requirements.txt (line 1)) (3.1.2) Requirement already satisfied: pytz in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.ug1b_vpf.requirements.txt (line 1)) (2022.4) Collecting tzlocal Using cached tzlocal-4.2-py3-none-any.whl (19 kB) Requirement already satisfied: pycparser in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from cffi>=1.10.0->rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.ug1b_vpf.requirements.txt (line 1)) (2.21) Requirement already satisfied: MarkupSafe>=2.0 in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from jinja2->rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.ug1b_vpf.requirements.txt (line 1)) (2.1.1) Collecting tzdata Using cached tzdata-2022.4-py2.py3-none-any.whl (336 kB) Collecting pytz-deprecation-shim Using cached pytz_deprecation_shim-0.1.0.post0-py2.py3-none-any.whl (15 kB) Building wheels for collected packages: rpy2 Building wheel for rpy2 (setup.py): started Building wheel for rpy2 (setup.py): finished with status 'done' Created wheel for rpy2: filename=rpy2-3.4.5-py3-none-any.whl size=198842 sha256=9b472eb2c0a65535eac19151a43e5d6fc3dbe5930a3953d76fcb2b170c8106ee Stored in directory: c:\users\ashish\appdata\local\pip\cache\wheels\ba\d8\8b\68fc240578a71188d0ca04b6fe8a58053fbcbcfbe2a3cbad12 Successfully built rpy2 Installing collected packages: tzdata, pytz-deprecation-shim, tzlocal, rpy2 Successfully installed pytz-deprecation-shim-0.1.0.post0 rpy2-3.4.5 tzdata-2022.4 tzlocal-4.2 done # # To activate this environment, use # # $ conda activate mh # # To deactivate an active environment, use # # $ conda deactivate Retrieving notices: ...working... done (base) C:\Users\ashish\Desktop> (base) C:\Users\ashish\Desktop> (base) C:\Users\ashish\Desktop>conda activate mh (mh) C:\Users\ashish\Desktop>python -m ipykernel install --user --name mh Installed kernelspec mh in C:\Users\ashish\AppData\Roaming\jupyter\kernels\mh

Same Error

(base) C:\Users\ashish\Desktop\20221004\code>conda activate mh (mh) C:\Users\ashish\Desktop\20221004\code>python Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:30:19) [MSC v.1929 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> from pyspark import SparkContext >>> from pyspark.sql import SQLContext >>> df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] }) >>> sc = SparkContext.getOrCreate() 22/10/07 16:30:36 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/10/07 16:30:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable >>> sqlCtx = SQLContext(sc) C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\pyspark\sql\context.py:112: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead. warnings.warn( >>> sdf = sqlCtx.createDataFrame(df) C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\pyspark\sql\pandas\conversion.py:474: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead. for column, series in pdf.iteritems(): C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\pyspark\sql\pandas\conversion.py:486: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead. for column, series in pdf.iteritems(): >>> >>> sdf.show() Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases. 22/10/07 16:32:04 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: java.net.SocketTimeoutException: Accept timed out at java.base/sun.nio.ch.NioSocketImpl.timedAccept(Unknown Source) at java.base/sun.nio.ch.NioSocketImpl.accept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.platformImplAccept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.accept(Unknown Source) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176) ... 29 more 22/10/07 16:32:04 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (LAPTOPHOST.ad.company.com executor driver): org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: java.net.SocketTimeoutException: Accept timed out at java.base/sun.nio.ch.NioSocketImpl.timedAccept(Unknown Source) at java.base/sun.nio.ch.NioSocketImpl.accept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.platformImplAccept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.accept(Unknown Source) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176) ... 29 more 22/10/07 16:32:04 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\pyspark\sql\dataframe.py", line 606, in show print(self._jdf.showString(n, 20, vertical)) File "C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\py4j\java_gateway.py", line 1321, in __call__ return_value = get_return_value( File "C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\pyspark\sql\utils.py", line 190, in deco return f(*a, **kw) File "C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\py4j\protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling o46.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (LAPTOPHOST.ad.company.com executor driver): org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: java.net.SocketTimeoutException: Accept timed out at java.base/sun.nio.ch.NioSocketImpl.timedAccept(Unknown Source) at java.base/sun.nio.ch.NioSocketImpl.accept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.platformImplAccept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.accept(Unknown Source) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176) ... 29 more Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:506) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:459) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3868) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2863) at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3858) at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3856) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856) at org.apache.spark.sql.Dataset.head(Dataset.scala:2863) at org.apache.spark.sql.Dataset.take(Dataset.scala:3084) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288) at org.apache.spark.sql.Dataset.showString(Dataset.scala:327) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Unknown Source) Caused by: org.apache.spark.SparkException: Python worker failed to connect back. at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ... 1 more Caused by: java.net.SocketTimeoutException: Accept timed out at java.base/sun.nio.ch.NioSocketImpl.timedAccept(Unknown Source) at java.base/sun.nio.ch.NioSocketImpl.accept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.platformImplAccept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.implAccept(Unknown Source) at java.base/java.net.ServerSocket.accept(Unknown Source) at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176) ... 29 more >>>
Tags: Technology,Spark,

Wednesday, October 5, 2022

Social Analysis (SOAN) based on Whatsapp data (Project Setup)

This project is being developed by: Maarten Grootendorst
Project Link: SOAN: GitHub    

In this post, we share a minimal YAML file to setup an environment for running SOAN project.


name: soan
channels:
  - conda-forge
dependencies:
  - pandas
  - seaborn
  - scikit-learn
  - matplotlib
  - emoji
  - pattern
  - wordcloud
  - palettable
  - ipykernel
  - jupyter
    


Run Logs

(base) ashish@ashish-Lenovo-ideapad-130-15IKB:~/Desktop$ conda env create -f env.yml Collecting package metadata (repodata.json): done Solving environment: done Downloading and Extracting Packages contourpy-1.0.5 | 234 KB | ### | 100% idna-3.4 | 55 KB | ### | 100% libiconv-1.17 | 1.4 MB | ### | 100% stack_data-0.5.1 | 24 KB | ### | 100% pyopenssl-22.0.0 | 120 KB | ### | 100% scipy-1.9.1 | 26.2 MB | ### | 100% jack-1.9.18 | 647 KB | ### | 100% mysqlclient-2.0.3 | 82 KB | ### | 100% mysql-libs-8.0.30 | 1.9 MB | ### | 100% fontconfig-2.14.0 | 318 KB | ### | 100% glib-tools-2.74.0 | 108 KB | ### | 100% fonttools-4.37.4 | 2.0 MB | ### | 100% qtpy-2.2.1 | 49 KB | ### | 100% python-fastjsonschem | 242 KB | ### | 100% libcap-2.65 | 97 KB | ### | 100% ca-certificates-2022 | 150 KB | ### | 100% gstreamer-1.20.3 | 2.0 MB | ### | 100% jaraco.context-4.1.2 | 9 KB | ### | 100% repoze.lru-0.7 | 13 KB | ### | 100% matplotlib-3.6.0 | 7 KB | ### | 100% jaraco.classes-3.2.2 | 10 KB | ### | 100% glib-2.74.0 | 438 KB | ### | 100% qtconsole-5.3.2 | 6 KB | ### | 100% qtconsole-base-5.3.2 | 91 KB | ### | 100% executing-1.1.0 | 23 KB | ### | 100% pyzmq-24.0.1 | 510 KB | ### | 100% nest-asyncio-1.5.6 | 10 KB | ### | 100% qt-main-5.15.6 | 61.5 MB | ### | 100% fftw-3.3.10 | 2.2 MB | ### | 100% sqlite-3.39.4 | 789 KB | ### | 100% libzlib-1.2.12 | 65 KB | ### | 100% routes-2.5.1 | 35 KB | ### | 100% portaudio-19.6.0 | 132 KB | ### | 100% regex-2022.9.13 | 386 KB | ### | 100% zc.lockfile-2.0 | 12 KB | ### | 100% nbconvert-pandoc-7.1 | 5 KB | ### | 100% cherrypy-18.8.0 | 504 KB | ### | 100% widgetsnbextension-4 | 1.6 MB | ### | 100% jaraco.collections-3 | 14 KB | ### | 100% pdfminer.six-2022052 | 4.9 MB | ### | 100% jupyterlab_widgets-3 | 222 KB | ### | 100% pcre2-10.37 | 1.1 MB | ### | 100% setuptools-65.4.1 | 776 KB | ### | 100% pytz-2022.4 | 232 KB | ### | 100% certifi-2022.9.24 | 155 KB | ### | 100% nbconvert-core-7.1.0 | 189 KB | ### | 100% expat-2.4.9 | 189 KB | ### | 100% libdeflate-1.14 | 81 KB | ### | 100% tzdata-2022d | 118 KB | ### | 100% numpy-1.23.3 | 7.0 MB | ### | 100% ipywidgets-8.0.2 | 109 KB | ### | 100% future-0.18.2 | 738 KB | ### | 100% terminado-0.16.0 | 18 KB | ### | 100% gettext-0.19.8.1 | 3.6 MB | ### | 100% simplejson-3.17.6 | 104 KB | ### | 100% nbformat-5.6.1 | 106 KB | ### | 100% nbconvert-7.1.0 | 6 KB | ### | 100% mysql-common-8.0.30 | 1.9 MB | ### | 100% pulseaudio-14.0 | 1.7 MB | ### | 100% libsqlite-3.39.4 | 803 KB | ### | 100% cryptography-38.0.1 | 1.6 MB | ### | 100% libglib-2.74.0 | 3.1 MB | ### | 100% alsa-lib-1.2.7.2 | 581 KB | ### | 100% jaraco.text-3.9.1 | 21 KB | ### | 100% ipykernel-6.16.0 | 100 KB | ### | 100% gst-plugins-base-1.2 | 2.8 MB | ### | 100% prompt_toolkit-3.0.3 | 5 KB | ### | 100% libtiff-4.4.0 | 651 KB | ### | 100% libxml2-2.10.2 | 727 KB | ### | 100% joblib-1.2.0 | 205 KB | ### | 100% matplotlib-base-3.6. | 7.5 MB | ### | 100% wordcloud-1.8.2.2 | 184 KB | ### | 100% Preparing transaction: done Verifying transaction: done Executing transaction: done # # To activate this environment, use # # $ conda activate soan # # To deactivate an active environment, use # # $ conda deactivate Retrieving notices: ...working... done

Setup the Jupyter Lab Kernel

(base) ashish@ashish-Lenovo-ideapad-130-15IKB:~/Desktop$ conda activate soan (soan) ashish@ashish-Lenovo-ideapad-130-15IKB:~/Desktop$ python -m ipykernel install --user --name soan Installed kernelspec soan in /home/ashish/.local/share/jupyter/kernels/soan (soan) ashish@ashish-Lenovo-ideapad-130-15IKB:~/Desktop$ jupyter kernelspec list Available kernels: mongodb /home/ashish/.local/share/jupyter/kernels/mongodb rasa_py38 /home/ashish/.local/share/jupyter/kernels/rasa_py38 soan /home/ashish/.local/share/jupyter/kernels/soan stock_market_prediction /home/ashish/.local/share/jupyter/kernels/stock_market_prediction python3 /home/ashish/anaconda3/envs/soan/share/jupyter/kernels/python3
Tags: Technology,Natural Language Processing,

Social Analysis (SOAN using Python 3) Report

This post (Dated: 21 Apr 2020) is in continuation from this post Social Analysis (SOAN) of WhatsApp Chats (an NLP and Pandas application)
    
Following information could be deciphered from the chat history:

1: Top 20 users based on message count
2: Top 20 Users based on Word Count
3: Line plots for "weekly number of messages by the users" Here only top 5 users are shown.
4: Weekly Number of Messages
5: Growth in Total Number of Messages on Weekly Basis
: Number of messages a user has sent on each week day Sunday: 326 Monday: 162 Thursday: 264 Friday: 348 Saturday: 405 Tuesday: 534 Wednesday: 157 : Line plot for messages sent in each hour
: Number of messages sent by any user in each hour for the overall period of time [(0, 98), (1, 10), (2, 4), (3, 2), (8, 15), (9, 63), (10, 91), (11, 79), (12, 163), (13, 207), (14, 81), (15, 90), (16, 114), (17, 93), (18, 78), (19, 106), (20, 148), (21, 233), (22, 292), (23, 229)]
Tags: Technology,Natural Language Processing,

Social Analysis (SOAN) of WhatsApp Chats

As of Jan 2020, The SOAN code requires Python 2.7.
We would soon have to migrate this code Python 3.X because Python 2.7 will reach the end of its life on January 1st, 2020.

STEP 1: We will create a Python 2.7.16 environment and a kernel for Anaconda as shown in this link "https://survival8.blogspot.com/p/installing-new-kernel-in-jupyter.html"

STEP 2: Install the necessary Python packages.

For this, I am generating a requirement.txt file using "pip freeze" command in my already set up environment as shown below:

(py_2716) D:\>pip freeze > requirements.txt 

On your machine that you need to set up, run the following command from directory having the "requirements.txt" file.

(py_2716) D:\>pip install -r requirements.txt 

A "pip freeze" of my py_2716 environment shows following packages:

backports-abc==0.5
backports.csv==1.0.7
backports.functools-lru-cache==1.5
backports.shutil-get-terminal-size==1.0.0
beautifulsoup4==4.7.1
certifi==2019.6.16
chardet==3.0.4
cheroot==6.5.5
CherryPy==17.4.2
colorama==0.4.1
contextlib2==0.5.5
cycler==0.10.0
decorator==4.4.0
emoji==0.5.2
enum34==1.1.6
feedparser==5.2.1
future==0.17.1
futures==3.3.0
idna==2.8
ipykernel==4.10.0
ipython==5.8.0
ipython-genutils==0.2.0
jaraco.functools==2.0
jupyter-client==5.3.1
jupyter-core==4.5.0
kiwisolver==1.1.0
lxml==4.3.4
matplotlib==2.2.4
more-itertools==5.0.0
mysqlclient==1.4.2
nltk==3.4.4
numpy==1.16.4
palettable==3.2.0
pandas==0.24.2
pathlib2==2.3.4
Pattern==3.6
pdfminer==20140328
pickleshare==0.7.5
Pillow==6.1.0
portend==2.5
prompt-toolkit==1.0.16
Pygments==2.4.2
pyparsing==2.4.0
python-dateutil==2.8.0
python-docx==0.8.10
pytz==2019.1
pywin32==224
pyzmq==18.0.2
regex==2019.6.8
requests==2.22.0
scandir==1.10.0
scikit-learn==0.20.3
scipy==1.2.2
seaborn==0.9.0
simplegeneric==0.8.1
singledispatch==3.4.0.3
six==1.12.0
sklearn==0.0
soupsieve==1.9.2
tempora==1.14.1
tornado==5.1.1
traitlets==4.3.2
urllib3==1.25.3
wcwidth==0.1.7
win-unicode-console==0.5
wincertstore==0.2
wordcloud==1.5.0
zc.lockfile==1.4

STEP 3: How we have generated data?

We have exported a particular group chat for this demo. This is done via the email option that comes in WhatsApp app. Exported file is a text file "WhatsApp Chat with Cousins (201910-201912).txt".

STEP 4:
The entire code is present on this Google drive link: https://drive.google.com/open?id=1r8oY5BhAlsG0womnWH5awQMdLqSKQebO
Tags: Technology,Natural Language Processing,

Python Packages Useful For Natural Language Processing

1. nltk

The Natural Language Toolkit (NLTK) is a Python package for natural language processing. NLTK requires Python 3.7, 3.8, 3.9 or 3.10. As in: 1.1. from nltk.sentiment.vader import SentimentIntensityAnalyzer 1.2. from nltk.stem import WordNetLemmatizer 1.3. from nltk.corpus import stopwords 1.4. from nltk.tokenize import word_tokenize

2. scikit-learn

One of its most popular usages in NLP: 2.1. from sklearn.feature_extraction.text import TfidfVectorizer 2.2. from sklearn.metrics.pairwise import cosine_similarity 2.3. from sklearn.manifold import TSNE 2.4. from sklearn.decomposition import LatentDirichletAllocation, PCA 2.5. from sklearn.cluster import AgglomerativeClustering Ref: Math with words

3. spaCy: Industrial-strength NLP

spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products. spaCy comes with pretrained pipelines and currently supports tokenization and training for 60+ languages. It features state-of-the-art speed and neural network models for tagging, parsing, named entity recognition, text classification and more, multi-task learning with pretrained transformers like BERT, as well as a production-ready training system and easy model packaging, deployment and workflow management. spaCy is commercial open-source software, released under the MIT license. Our use case involved NER capability of spaCy: Ref: % Exploring Word2Vec % Python Code to Create Annotations For SpaCy NER

4. gensim

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. Ref: PyPI Our usecase involved: 4.1. from gensim.models import Word2Vec 4.2. from gensim.corpora.dictionary import Dictionary 4.3. from gensim.models.lsimodel import LsiModel, stochastic_svd 4.4. from gensim.models.coherencemodel import CoherenceModel 4.5. from gensim.models.ldamodel import LdaModel # Latent Dirichlet Allocation and not 'Latent Discriminant Analysis' 4.6. from gensim.models import RpModel 4.7. from gensim.matutils import corpus2dense, Dense2Corpus 4.8. from gensim.test.utils import common_texts

5. word2vec

Python interface to Google word2vec. Training is done using the original C code, other functionality is pure Python with numpy.

6. GloVe

Cython general implementation of the Glove multi-threaded training. GloVe is an unsupervised learning algorithm for generating vector representations for words. Training is done using a co-occcurence matrix from a corpus. The resulting representations contain structure useful for many other tasks. The paper describing the model is [here]. The original implementation for this Machine Learning model can be [found here].

7. fastText

fastText is a library for efficient learning of word representations and sentence classification. Ref: % PyPI % Reasoning with Word Vectors

8. TextWiser: Text Featurization Library

TextWiser (AAAI'21) is a research library that provides a unified framework for text featurization based on a rich set of methods while taking advantage of pretrained models provided by the state-of-the-art libraries. The main contributions include: Rich Set of Embeddings: A wide range of available embeddings and transformations to choose from. Fine-Tuning: Designed to support a PyTorch backend, and hence, retains the ability to fine-tune featurizations for downstream tasks. That means, if you pass the resulting fine-tunable embeddings to a training method, the features will be optimized automatically for your application. Parameter Optimization: Interoperable with the standard scikit-learn pipeline for hyper-parameter tuning and rapid experimentation. All underlying parameters are exposed to the user. Grammar of Embeddings: Introduces a novel approach to design embeddings from components. The compound embedding allows forming arbitrarily complex embeddings in accordance with a context-free grammar that defines a formal language for valid text featurization. GPU Native: Built with GPUs in mind. If it detects available hardware, the relevant models are automatically placed on the GPU. TextWiser is developed by the Artificial Intelligence Center of Excellence at Fidelity Investments. Documentation is available at fidelity.github.io/textwiser. Here is the video of the paper presentation at AAAI 2021. Our Usecase Involved: Document Embeddings (Doc2Vec): Supported by gensim % Defaults to training from scratch Ref: PyPI

9. BERT-As-a-Service

pip install bert-serving-server # server pip install bert-serving-client # client, independent of `bert-serving-server` Ref: % Getting Started with BERT-As-a-Service % Word Embeddings Using BERT (Demo of BERT-As-a-Service)

10. transformers

State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow: Transformers provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio. These models can be applied on: - Text, for tasks like text classification, information extraction, question answering, summarization, translation, text generation, in over 100 languages. - Images, for tasks like image classification, object detection, and segmentation. - Audio, for tasks like speech recognition and audio classification. Transformer models can also perform tasks on several modalities combined, such as table question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering. Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and can be modified to enable quick research experiments. Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration between them. It's straightforward to train your models with one before loading them for inference with the other. Ref: % Word Embeddings using BERT and testing using Word Analogies, Nearest Words, 1D Spectrum % PyPI % conda-forge

11. torch

PyTorch is a Python package that provides two high-level features: % Tensor computation (like NumPy) with strong GPU acceleration % Deep neural networks built on a tape-based autograd system You can reuse your favorite Python packages such as NumPy, SciPy, and Cython to extend PyTorch when needed. Ref: PyPI

12. sentence-transformers

Sentence Transformers: Multilingual Sentence, Paragraph, and Image Embeddings using BERT & Co. This framework provides an easy method to compute dense vector representations for sentences, paragraphs, and images. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various task. Text is embedding in vector space such that similar text is close and can efficiently be found using cosine similarity. We provide an increasing number of state-of-the-art pretrained models for more than 100 languages, fine-tuned for various use-cases. Further, this framework allows an easy fine-tuning of custom embeddings models, to achieve maximal performance on your specific task. For the full documentation, see www.SBERT.net. The following publications are integrated in this framework: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (EMNLP 2019) Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation (EMNLP 2020) Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks (NAACL 2021) The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes (arXiv 2020) TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning (arXiv 2021) BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models (arXiv 2021) Ref: % PyPI % conda-forge

13. Scrapy

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Scrapy is maintained by Zyte (formerly Scrapinghub) and many other contributors. Check the Scrapy homepage at https://scrapy.org for more information, including a list of features. Requirements: % Python 3.6+ % Works on Linux, Windows, macOS, BSD

14. Rasa

% Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more % Create chatbots and voice assistants Ref: Rasa

15. Sentiment Analysis using BERT, DistilBERT and ALBERT

Sentiment analysis neural network trained by fine-tuning BERT, ALBERT, or DistilBERT on the Stanford Sentiment Treebank. Ref: % barissayil/SentimentAnalysis % Sentiment Analysis Using BERT

16. pyLDAvis

Python library for interactive topic model visualization. This is a port of the fabulous R package by Carson Sievert and Kenny Shirley. pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization. The visualization is intended to be used within an IPython notebook but can also be saved to a stand-alone HTML file for easy sharing. Note: LDA stands for latent Dirichlet allocation. Ref: % PyPI % Creating taxonomy for BBC news articles

17. scipy

As in for cosine distance calculation: from scipy.spatial.distance import cosine Note: from sklearn.metrics.pairwise import cosine_similarity # Expects 2D arrays as input from scipy.spatial.distance import cosine # Works with 1D vectors

18. twitter

The Minimalist Twitter API for Python is a Python API for Twitter, everyone's favorite Web 2.0 Facebook-style status updater for people on the go. Also included is a Twitter command-line tool for getting your friends' tweets and setting your own tweet from the safety and security of your favorite shell and an IRC bot that can announce Twitter updates to an IRC channel. Ref: % https://pypi.org/project/twitter/ % https://survival8.blogspot.com/2022/09/using-twitter-api-to-fetch-trending.html

19. spark-nlp

Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides simple, performant & accurate NLP annotations for machine learning pipelines that scale easily in a distributed environment. Spark NLP comes with 11000+ pretrained pipelines and models in more than 200+ languages. It also offers tasks such as Tokenization, Word Segmentation, Part-of-Speech Tagging, Word and Sentence Embeddings, Named Entity Recognition, Dependency Parsing, Spell Checking, Text Classification, Sentiment Analysis, Token Classification, Machine Translation (+180 languages), Summarization, Question Answering, Table Question Answering, Text Generation, Image Classification, Automatic Speech Recognition, and many more NLP tasks. Spark NLP is the only open-source NLP library in production that offers state-of-the-art transformers such as BERT, CamemBERT, ALBERT, ELECTRA, XLNet, DistilBERT, RoBERTa, DeBERTa, XLM-RoBERTa, Longformer, ELMO, Universal Sentence Encoder, Google T5, MarianMT, GPT2, and Vision Transformers (ViT) not only to Python and R, but also to JVM ecosystem (Java, Scala, and Kotlin) at scale by extending Apache Spark natively. Ref: https://pypi.org/project/spark-nlp/

20. keras-transformer

Popular Usage: Machine Translation Ref: # PyPI # GitHub

21. pronouncing

Pronouncing is a simple interface for the CMU Pronouncing Dictionary. It’s easy to use and has no external dependencies. For example, here’s how to find rhymes for a given word: >>> import pronouncing >>> pronouncing.rhymes("climbing") ['diming', 'liming', 'priming', 'rhyming', 'timing'] Ref: https://pypi.org/project/pronouncing/

22. random-word

This is a simple python package to generate random English words.

23. langdetect

Port of Nakatani Shuyo's language-detection library (version from 03/03/2014) to Python. langdetect supports 55 languages out of the box (ISO 639-1 codes): af, ar, bg, bn, ca, cs, cy, da, de, el, en (English), es, et, fa, fi, fr, gu, he, hi (Hindi), hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl, pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn, zh-tw Ref: https://pypi.org/project/langdetect/

24. PyPDF2

PyPDF2 is a free and open-source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. PyPDF2 can retrieve text and metadata from PDFs as well.

25. python-docx

python-docx is a Python library for creating and updating Microsoft Word (.docx) files. Installation: $ pip install python-docx Usage: import docx Note: Does not support .pdf and .doc Ref: % github % Convert MS Word files into PDF format

26. emoji

Emoji for Python. This project was inspired by kyokomi. The entire set of Emoji codes as defined by the Unicode consortium is supported in addition to a bunch of aliases. By default, only the official list is enabled but doing emoji.emojize(language='alias') enables both the full list and aliases. Ref: % PyPI % conda-forge % Social Analysis (SOAN using Python 3) Report

27. pattern

Web mining module for Python. Ref: % PyPI % conda-forge

28. wordcloud

A little word cloud generator in Python. Read more about it on the blog post or the website. The code is tested against Python 2.7, 3.4, 3.5, 3.6 and 3.7. [Dated: 20221005]

Installation with pip3

$ pip3 install wordcloud Collecting wordcloud Downloading wordcloud-1.8.2.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (458 kB) |████████████████████████████████| 458 kB 13 kB/s Requirement already satisfied: numpy>=1.6.1 in /home/ashish/anaconda3/lib/python3.9/site-packages (from wordcloud) (1.21.5) Requirement already satisfied: matplotlib in /home/ashish/anaconda3/lib/python3.9/site-packages (from wordcloud) (3.5.1) Requirement already satisfied: pillow in /home/ashish/anaconda3/lib/python3.9/site-packages (from wordcloud) (9.0.1) Requirement already satisfied: cycler>=0.10 in /home/ashish/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /home/ashish/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (4.25.0) Requirement already satisfied: packaging>=20.0 in /home/ashish/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (21.3) Requirement already satisfied: python-dateutil>=2.7 in /home/ashish/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (2.8.2) Requirement already satisfied: kiwisolver>=1.0.1 in /home/ashish/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (1.3.2) Requirement already satisfied: pyparsing>=2.2.1 in /home/ashish/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (3.0.4) Requirement already satisfied: six>=1.5 in /home/ashish/anaconda3/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0) Installing collected packages: wordcloud Successfully installed wordcloud-1.8.2.2 Ref: % PyPI % conda-forge

29. Social Analysis (SOAN)

Social Analysis based on Whatsapp data Ref: GitHub
Tags: Technology,Natural Language Processing,

Monday, October 3, 2022

The RulePolicy of Rasa (Oct 2022)

You would come across the RulePolicy configuration of Rasa if you ever land up on this Rasa documentation page: https://rasa.com/docs/rasa/policies/

The page for Forms says:

Usage

To use forms with Rasa Open Source you need to make sure that the Rule Policy is added to your policy configuration. For example:
policies: - name: RulePolicy To find out the lack of proper logging and getting to understand the erroneous situation that this policy can land you in, do the following. Do a "$ rasa init" to initialize a project. Add the following code to the "config.yml": policies: - name: RulePolicy Next, do: $ rasa train $ rasa interactive

Next we are going to show chat logs 'with the RulePolicy configuration' and 'without the RulePolicy configuration'

Without the RulePolicy configuration (this is the default setting you get from '$ rasa init')

(base) ashish@ashish-Lenovo-ideapad-130-15IKB:~/Desktop/ws/jupyter/rasa/project2$ rasa interactive 2022-10-04 00:09:02 INFO numexpr.utils - NumExpr defaulting to 4 threads. The configuration for policies and pipeline was chosen automatically. It was written into the config file at 'config.yml'. /home/ashish/.local/lib/python3.8/site-packages/matplotlib/__init__.py:169: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. if LooseVersion(module.__version__) < minver: /home/ashish/anaconda3/envs/rasa_py38/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. other = LooseVersion(other) /home/ashish/.local/lib/python3.8/site-packages/tensorflow_addons/utils/ensure_tf_install.py:47: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. min_version = LooseVersion(INCLUSIVE_MIN_TF_VERSION) 2022-10-04 00:09:09 INFO rasa.engine.training.hooks - Restored component 'CountVectorsFeaturizer' from cache. 2022-10-04 00:09:09 INFO rasa.engine.training.hooks - Restored component 'CountVectorsFeaturizer' from cache. 2022-10-04 00:09:09 INFO rasa.engine.training.hooks - Restored component 'DIETClassifier' from cache. 2022-10-04 00:09:09 INFO rasa.engine.training.hooks - Restored component 'EntitySynonymMapper' from cache. 2022-10-04 00:09:09 INFO rasa.engine.training.hooks - Restored component 'LexicalSyntacticFeaturizer' from cache. 2022-10-04 00:09:09 INFO rasa.engine.training.hooks - Restored component 'MemoizationPolicy' from cache. 2022-10-04 00:09:09 INFO rasa.engine.training.hooks - Restored component 'RegexFeaturizer' from cache. 2022-10-04 00:09:09 INFO rasa.engine.training.hooks - Restored component 'ResponseSelector' from cache. 2022-10-04 00:09:09 INFO rasa.engine.training.hooks - Restored component 'RulePolicy' from cache. 2022-10-04 00:09:09 INFO rasa.engine.training.hooks - Restored component 'TEDPolicy' from cache. 2022-10-04 00:09:09 INFO rasa.engine.training.hooks - Restored component 'UnexpecTEDIntentPolicy' from cache. Your Rasa model is trained and saved at 'models/20221004-000906-cream-hoops.tar.gz'. /home/ashish/.local/lib/python3.8/site-packages/colorclass/codes.py:5: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working from collections import Mapping /home/ashish/.local/lib/python3.8/site-packages/sanic_cors/extension.py:39: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. SANIC_VERSION = LooseVersion(sanic_version) 2022-10-04 00:09:24 INFO rasa.core.processor - Loading model models/20221004-000906-cream-hoops.tar.gz... 2022-10-04 00:09:51 WARNING rasa.shared.utils.common - The UnexpecTED Intent Policy is currently experimental and might change or be removed in the future 🔬 Please share your feedback on it in the forum (https://forum.rasa.com) to help us make this feature ready for production. 2022-10-04 00:10:04 INFO root - Rasa server is up and running. Processed story blocks: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1446.98it/s, # trackers=1] Processed rules: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2095.58it/s, # trackers=1] Bot loaded. Visualisation at http://localhost:5006/visualization.html . Type a message and press enter (press 'Ctrl-c' to exit). ? Your input -> hi ? Your NLU model classified 'hi' with intent 'greet' and there are no entities, is this correct? Yes ------ Chat History # Bot You ──────────────────────────────────────────── 1 action_listen ──────────────────────────────────────────── 2 hi intent: greet 1.00 Current slots: session_started_metadata: None ------ ? The bot wants to run 'utter_greet', correct? Yes /home/ashish/.local/lib/python3.8/site-packages/rasa/server.py:860: FutureWarning: The "POST /conversations/<conversation_id>/execute" endpoint is deprecated. Inserting actions to the tracker externally should be avoided. Actions should be predicted by the policies only. rasa.shared.utils.io.raise_warning( ------ Chat History # Bot You ──────────────────────────────────────────────── 1 action_listen ──────────────────────────────────────────────── 2 hi intent: greet 1.00 ──────────────────────────────────────────────── 3 utter_greet 1.00 Hey! How are you? ------ Current slots: session_started_metadata: None ? The bot wants to run 'action_listen', correct? Yes ? Your input -> great ? Your NLU model classified 'great' with intent 'mood_great' and there are no entities, is this correct? Yes ------ Chat History ------ # Bot You ────────────────────────────────────────────────────── 1 action_listen ────────────────────────────────────────────────────── 2 hi intent: greet 1.00 ────────────────────────────────────────────────────── 3 utter_greet 1.00 Hey! How are you? action_listen 1.00 ────────────────────────────────────────────────────── 4 great intent: mood_great 1.00 Current slots: session_started_metadata: None ? The bot wants to run 'utter_happy', correct? Yes ------ Chat History # Bot You ────────────────────────────────────────────────────── 1 action_listen ────────────────────────────────────────────────────── 2 hi intent: greet 1.00 ────────────────────────────────────────────────────── 3 utter_greet 1.00 Hey! How are you? action_listen 1.00 ────────────────────────────────────────────────────── 4 great intent: mood_great 1.00 ────────────────────────────────────────────────────── 5 utter_happy 1.00 Great, carry on! ------ Current slots: session_started_metadata: None ? The bot wants to run 'action_listen', correct? Yes ? Your input ->

Logs with the RulePolicy configuration

Erroneous Logs (Part 1)

(base) ashish@ashish-Lenovo-ideapad-130-15IKB:~/Desktop/ws/jupyter/rasa/project2$ rasa interactive 2022-10-03 23:35:00 INFO numexpr.utils - NumExpr defaulting to 4 threads. The configuration for pipeline was chosen automatically. It was written into the config file at 'config.yml'. /home/ashish/.local/lib/python3.8/site-packages/matplotlib/__init__.py:169: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. if LooseVersion(module.__version__) < minver: /home/ashish/anaconda3/envs/rasa_py38/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. other = LooseVersion(other) /home/ashish/.local/lib/python3.8/site-packages/tensorflow_addons/utils/ensure_tf_install.py:47: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. min_version = LooseVersion(INCLUSIVE_MIN_TF_VERSION) 2022-10-03 23:35:08 INFO rasa.engine.training.hooks - Restored component 'CountVectorsFeaturizer' from cache. 2022-10-03 23:35:08 INFO rasa.engine.training.hooks - Restored component 'CountVectorsFeaturizer' from cache. 2022-10-03 23:35:08 INFO rasa.engine.training.hooks - Restored component 'DIETClassifier' from cache. 2022-10-03 23:35:08 INFO rasa.engine.training.hooks - Restored component 'EntitySynonymMapper' from cache. 2022-10-03 23:35:08 INFO rasa.engine.training.hooks - Restored component 'LexicalSyntacticFeaturizer' from cache. 2022-10-03 23:35:08 INFO rasa.engine.training.hooks - Restored component 'RegexFeaturizer' from cache. 2022-10-03 23:35:08 INFO rasa.engine.training.hooks - Restored component 'ResponseSelector' from cache. 2022-10-03 23:35:08 INFO rasa.engine.training.hooks - Restored component 'RulePolicy' from cache. Your Rasa model is trained and saved at 'models/20221003-233504-energetic-palace.tar.gz'. /home/ashish/.local/lib/python3.8/site-packages/colorclass/codes.py:5: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working from collections import Mapping /home/ashish/.local/lib/python3.8/site-packages/sanic_cors/extension.py:39: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. SANIC_VERSION = LooseVersion(sanic_version) 2022-10-03 23:35:20 INFO rasa.core.processor - Loading model models/20221003-233504-energetic-palace.tar.gz... 2022-10-03 23:35:33 INFO root - Rasa server is up and running. Processed story blocks: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1607.63it/s, # trackers=1] Processed rules: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2113.53it/s, # trackers=1] Bot loaded. Visualisation at http://localhost:5006/visualization.html . Type a message and press enter (press 'Ctrl-c' to exit). ? Your input -> hi ? Your NLU model classified 'hi' with intent 'greet' and there are no entities, is this correct? Yes ------ Chat History # Bot You ──────────────────────────────────────────── 1 action_listen ──────────────────────────────────────────── 2 hi intent: greet 1.00 ------ Current slots: session_started_metadata: None ? The bot wants to run 'action_default_fallback', correct? (Y/n) Cancelled by user ? Do you want to stop? (Use arrow keys) Cancelled by user ? Export stories to (if file exists, this will append the stories) data/stories.yml Cancelled by user 2022-10-03 23:36:09 INFO rasa.core.training.interactive - Killing Sanic server now.

Erroneous Logs (Part 2)

? Your input -> hi ? Your NLU model classified 'hi' with intent 'greet' and there are no entities, is this correct? Yes ------ Chat History # Bot You ──────────────────────────────────────────── 1 action_listen ──────────────────────────────────────────── 2 hi intent: greet 1.00 Current slots: session_started_metadata: None ------ ? The bot wants to run 'action_default_fallback', correct? Yes /home/ashish/.local/lib/python3.8/site-packages/rasa/server.py:860: FutureWarning: The "POST /conversations/<conversation_id>/execute" endpoint is deprecated. Inserting actions to the tracker externally should be avoided. Actions should be predicted by the policies only. rasa.shared.utils.io.raise_warning( ------ Chat History # Bot You ──────────────────────────────────── 1 action_listen ------ Current slots: session_started_metadata: None ? The bot wants to run 'action_default_fallback', correct? Yes ------ Chat History ------ # Bot You ──────────────────────────────────── 1 action_listen Current slots: session_started_metadata: None ? The bot wants to run 'action_default_fallback', correct? Yes ------

What is this "action_default_fallback"?

The answer comes from Rasa Documentation:

Policy Priority

In the case that two policies predict with equal confidence (for example, the Memoization and Rule Policies might both predict with confidence 1), the priority of the policies is considered. Rasa Open Source policies have default priorities that are set to ensure the expected outcome in the case of a tie. They look like this, where higher numbers have higher priority: 6 - RulePolicy 3 - MemoizationPolicy or AugmentedMemoizationPolicy 2 - UnexpecTEDIntentPolicy 1 - TEDPolicy

Rule Policy

The RulePolicy is a policy that handles conversation parts that follow a fixed behavior (e.g. business logic). It makes predictions based on any rules you have in your training data. See the Rules documentation for further information on how to define rules. The RulePolicy has the following configuration options: File: config.yml policies: - name: "RulePolicy" core_fallback_threshold: 0.3 core_fallback_action_name: action_default_fallback enable_fallback_prediction: true restrict_rules: true check_for_contradictions: true The values shown above are the default settings. Ref: https://rasa.com/docs/rasa/policies
Tags: Technology,Natural Language Processing,Rasa,

6 Labeled Datasets For Sentiment Analysis

Download Code and Data

import pandas as pd
import seaborn as sns


1. Amazon Reviews

amazon_reviews = pd.read_csv('input/amazonReviewSnippets_GroundTruth.txt', sep = '\t') amazon_reviews['dataset'] = 'amazon' def get_sentiment_label(sentiment_score): if (sentiment_score < 0): return 'Negative' else: return 'Positive' amazon_reviews['sentiment_label'] = amazon_reviews['sentiment'].apply(get_sentiment_label) amazon_reviews['length'] = amazon_reviews['text'].apply(len) def get_word_count(text): text = text.split() return len(text) amazon_reviews['word_count'] = amazon_reviews['text'].apply(get_word_count) amazon_reviews.head()
sns.countplot(x ='sentiment_label', data = amazon_reviews)
amazon_reviews['word_count'].describe() count 3546.000000 mean 17.300056 std 31.449383 min 1.000000 25% 9.000000 50% 15.000000 75% 21.000000 max 1220.000000 Name: word_count, dtype: float64

If number of max number of tokens in a text exceeds 512, plain BERT embedding cannot be used and we have to use SentenceBERT as the embedding technique.

2. Movie Reviews

movie_reviews = pd.read_csv('input/movieReviewSnippets_GroundTruth.txt', sep = '\t') movie_reviews['dataset'] = 'movie reviews' movie_reviews['sentiment_label'] = movie_reviews['sentiment'].apply(get_sentiment_label) movie_reviews['word_count'] = movie_reviews['text'].apply(get_word_count) movie_reviews.head(5)
sns.countplot(x ='sentiment_label', data = movie_reviews)
movie_reviews['word_count'].describe() count 10605.000000 mean 18.864875 std 8.702398 min 1.000000 25% 12.000000 50% 18.000000 75% 25.000000 max 51.000000 Name: word_count, dtype: float64

3. New York Editorial Snippets

nyt_editorial_snippets = pd.read_csv('input/nytEditorialSnippets_GroundTruth.txt', sep = '\t') nyt_editorial_snippets['dataset'] = 'nyt_editorial_snippets' nyt_editorial_snippets['sentiment_label'] = nyt_editorial_snippets['sentiment'].apply(get_sentiment_label) nyt_editorial_snippets['word_count'] = nyt_editorial_snippets['text'].apply(get_word_count) nyt_editorial_snippets.head()
sns.countplot(x ='sentiment_label', data = nyt_editorial_snippets)
nyt_editorial_snippets['word_count'].describe() count 5183.000000 mean 17.482925 std 8.767046 min 1.000000 25% 11.000000 50% 17.000000 75% 23.000000 max 91.000000 Name: word_count, dtype: float64

4. General Twitter Data (Tweets)

tweets_groud_truth = pd.read_csv('input/tweets_GroundTruth.txt', sep = '\t') tweets_groud_truth['dataset'] = 'tweets_groud_truth' tweets_groud_truth['sentiment_label'] = tweets_groud_truth['sentiment'].apply(get_sentiment_label) tweets_groud_truth['word_count'] = tweets_groud_truth['text'].apply(get_word_count) tweets_groud_truth.head()
sns.countplot(x ='sentiment_label', data = tweets_groud_truth)
tweets_groud_truth['word_count'].describe() count 4200.000000 mean 13.619286 std 6.720463 min 1.000000 25% 8.000000 50% 13.000000 75% 19.000000 max 32.000000 Name: word_count, dtype: float64

5. US Presidential Election of 2016

us_presidential_election_2016 = pd.read_csv('input/us_politics_presidential_election_2016.csv', sep = ',') us_presidential_election_2016 = us_presidential_election_2016[['id', 'sentiment', 'text']] us_presidential_election_2016['dataset'] = 'us_presidential_election_2016' us_presidential_election_2016.head()
sns.countplot(x ='sentiment', data = us_presidential_election_2016)
us_presidential_election_2016['word_count'] = us_presidential_election_2016['text'].apply(get_word_count) us_presidential_election_2016['word_count'].describe() count 13871.000000 mean 16.943912 std 5.224908 min 2.000000 25% 13.000000 50% 18.000000 75% 21.000000 max 29.000000 Name: word_count, dtype: float64

6. Stock Market Related Tweets

stock_market_tweets = pd.read_csv('input/stock_market_twitter_data.csv') stock_market_tweets['sentiment_label'] = stock_market_tweets['Sentiment'].apply(get_sentiment_label) stock_market_tweets['word_count'] = stock_market_tweets['Text'].apply(get_word_count) stock_market_tweets.head()
sns.countplot(x ='sentiment_label', data = stock_market_tweets)
stock_market_tweets['word_count'].describe() count 5791.000000 mean 14.006562 std 6.595463 min 2.000000 25% 9.000000 50% 14.000000 75% 19.000000 max 32.000000 Name: word_count, dtype: float64
Tags: Technology,Natural Language Processing,