survival8

Friday, October 7, 2022

Spark Installation on Windows (2022-Oct-07, Status Failure, Part 1)

$ conda install pyspark -c conda-forge

---------------------------

Test

In Jupyter Lab

import pyspark 


- - - - -
TypeError                                 Traceback (most recent call last)
Input In [9], in <cell line: 1>()
----> 1 import pyspark

File ~\Anaconda3\lib\site-packages\pyspark\__init__.py:51, in <module>
        48 import types
        50 from pyspark.conf import SparkConf
---> 51 from pyspark.context import SparkContext
        52 from pyspark.rdd import RDD, RDDBarrier
        53 from pyspark.files import SparkFiles

File ~\Anaconda3\lib\site-packages\pyspark\context.py:31, in <module>
        27 from tempfile import NamedTemporaryFile
        29 from py4j.protocol import Py4JError
---> 31 from pyspark import accumulators
        32 from pyspark.accumulators import Accumulator
        33 from pyspark.broadcast import Broadcast, BroadcastPickleRegistry

File ~\Anaconda3\lib\site-packages\pyspark\accumulators.py:97, in <module>
        95     import socketserver as SocketServer
        96 import threading
---> 97 from pyspark.serializers import read_int, PickleSerializer
    100 __all__ = ['Accumulator', 'AccumulatorParam']
    103 pickleSer = PickleSerializer()

File ~\Anaconda3\lib\site-packages\pyspark\serializers.py:71, in <module>
        68     protocol = 3
        69     xrange = range
---> 71 from pyspark import cloudpickle
        72 from pyspark.util import _exception_message
        75 __all__ = ["PickleSerializer", "MarshalSerializer", "UTF8Deserializer"]

File ~\Anaconda3\lib\site-packages\pyspark\cloudpickle.py:145, in <module>
    125     else:
    126         return types.CodeType(
    127             co.co_argcount,
    128             co.co_kwonlyargcount,
    (...)
    141             (),
    142         )
--> 145 _cell_set_template_code = _make_cell_set_template_code()
    148 def cell_set(cell, value):
    149     """Set the value of a closure cell.
    150     """

File ~\Anaconda3\lib\site-packages\pyspark\cloudpickle.py:126, in _make_cell_set_template_code()
    109     return types.CodeType(
    110         co.co_argcount,
    111         co.co_nlocals,
    (...)
    123         (),
    124     )
    125 else:
--> 126     return types.CodeType(
    127         co.co_argcount,
    128         co.co_kwonlyargcount,
    129         co.co_nlocals,
    130         co.co_stacksize,
    131         co.co_flags,
    132         co.co_code,
    133         co.co_consts,
    134         co.co_names,
    135         co.co_varnames,
    136         co.co_filename,
    137         co.co_name,
    138         co.co_firstlineno,
    139         co.co_lnotab,
    140         co.co_cellvars,  # this is the trickery
    141         (),
    142     )

TypeError: an integer is required (got type bytes)


In Python CLI


(base) C:\Users\ashish>python
Python 3.9.12 (main, Apr  4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark



Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "C:\Users\ashish\Anaconda3\lib\site-packages\pyspark\__init__.py", line 51, in <module>
    from pyspark.context import SparkContext
    File "C:\Users\ashish\Anaconda3\lib\site-packages\pyspark\context.py", line 31, in <module>
    from pyspark import accumulators
    File "C:\Users\ashish\Anaconda3\lib\site-packages\pyspark\accumulators.py", line 97, in <module>
    from pyspark.serializers import read_int, PickleSerializer
    File "C:\Users\ashish\Anaconda3\lib\site-packages\pyspark\serializers.py", line 71, in <module>
    from pyspark import cloudpickle
    File "C:\Users\ashish\Anaconda3\lib\site-packages\pyspark\cloudpickle.py", line 145, in <module>
    _cell_set_template_code = _make_cell_set_template_code()
    File "C:\Users\ashish\Anaconda3\lib\site-packages\pyspark\cloudpickle.py", line 126, in _make_cell_set_template_code
    return types.CodeType(
TypeError: an integer is required (got type bytes)
>>> exit()


---------------------------

(base) C:\Users\ashish>conda uninstall pyspark
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

    environment location: C:\Users\ashish\Anaconda3

    removed specs:
    - pyspark


The following packages will be REMOVED:

    py4j-0.10.7-py_1
    pyspark-2.4.4-py_0
    python_abi-3.9-2_cp39

The following packages will be SUPERSEDED by a higher-priority channel:

    conda              conda-forge::conda-22.9.0-py39hcbf530~ --> pkgs/main::conda-22.9.0-py39haa95532_0 None


Proceed ([y]/n)? y

Preparing transaction: done
Verifying transaction: done
Executing transaction: done



---------------------------

Re-installing via pip3

(base) C:\Users\ashish>pip3 install pyspark
Collecting pyspark
    Downloading pyspark-3.3.0.tar.gz (281.3 MB)
        |████████████████████████████████| 281.3 MB 70 kB/s
Collecting py4j==0.10.9.5
    Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
        |████████████████████████████████| 199 kB 285 kB/s
Building wheels for collected packages: pyspark
    Building wheel for pyspark (setup.py) ... done
    Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764040 sha256=0684a408679e5a3890611f7025e07d22688f5142c0fe6b90ca535e805b9ae007
    Stored in directory: c:\users\ashish\appdata\local\pip\cache\wheels\05\75\73\81f84d174299abca38dd6a06a5b98b08ae25fce50ab8986fa1
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0

Test Through Top Import And Printing Version

(base) C:\Users\ashish>python
Python 3.9.12 (main, Apr  4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
>>> pyspark.__version__
'3.3.0'

Further Testing Through Code

In Jupyter Lab


import pandas as pd
from pyspark import SparkContext
from pyspark.sql import SQLContext # Main entry point for DataFrame and SQL functionality.

df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] })

sc = SparkContext.getOrCreate()
sqlCtx = SQLContext(sc)

sdf = sqlCtx.createDataFrame(df)
    



- - - - -
Py4JJavaError                             Traceback (most recent call last)
Input In [35], in <cell line: 1>()
----> 1 sdf.show()

File ~\Anaconda3\lib\site-packages\pyspark\sql\dataframe.py:606, in DataFrame.show(self, n, truncate, vertical)
    603     raise TypeError("Parameter 'vertical' must be a bool")
    605 if isinstance(truncate, bool) and truncate:
--> 606     print(self._jdf.showString(n, 20, vertical))
    607 else:
    608     try:

File ~\Anaconda3\lib\site-packages\py4j\java_gateway.py:1321, in JavaMember.__call__(self, *args)
    1315 command = proto.CALL_COMMAND_NAME +\
    1316     self.command_header +\
    1317     args_command +\
    1318     proto.END_COMMAND_PART
    1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
    1322     answer, self.gateway_client, self.target_id, self.name)
    1324 for temp_arg in temp_args:
    1325     temp_arg._detach()

File ~\Anaconda3\lib\site-packages\pyspark\sql\utils.py:190, in capture_sql_exception..deco(*a, **kw)
    188 def deco(*a: Any, **kw: Any) -> Any:
    189     try:
--> 190         return f(*a, **kw)
    191     except Py4JJavaError as e:
    192         converted = convert_exception(e.java_exception)

File ~\Anaconda3\lib\site-packages\py4j\protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
    324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325 if answer[1] == REFERENCE_TYPE:
--> 326     raise Py4JJavaError(
    327         "An error occurred while calling {0}{1}{2}.\n".
    328         format(target_id, ".", name), value)
    329 else:
    330     raise Py4JError(
    331         "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
    332         format(target_id, ".", name, value))

Py4JJavaError: An error occurred while calling o477.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2) (LAPTOPHOST.ad.company.com executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189)
    at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
    at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
    at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:136)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
    at java.base/sun.nio.ch.NioSocketImpl.timedAccept(Unknown Source)
    at java.base/sun.nio.ch.NioSocketImpl.accept(Unknown Source)
    at java.base/java.net.ServerSocket.implAccept(Unknown Source)
    at java.base/java.net.ServerSocket.platformImplAccept(Unknown Source)
    at java.base/java.net.ServerSocket.implAccept(Unknown Source)
    at java.base/java.net.ServerSocket.implAccept(Unknown Source)
    at java.base/java.net.ServerSocket.accept(Unknown Source)
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176)
    ... 29 more

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
    at scala.Option.foreach(Option.scala:407)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268)
    at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:506)
    at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:459)
    at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
    at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3868)
    at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2863)
    at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3858)
    at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510)
    at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3856)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856)
    at org.apache.spark.sql.Dataset.head(Dataset.scala:2863)
    at org.apache.spark.sql.Dataset.take(Dataset.scala:3084)
    at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288)
    at org.apache.spark.sql.Dataset.showString(Dataset.scala:327)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.base/java.lang.reflect.Method.invoke(Unknown Source)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189)
    at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
    at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
    at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:136)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    ... 1 more
Caused by: java.net.SocketTimeoutException: Accept timed out
    at java.base/sun.nio.ch.NioSocketImpl.timedAccept(Unknown Source)
    at java.base/sun.nio.ch.NioSocketImpl.accept(Unknown Source)
    at java.base/java.net.ServerSocket.implAccept(Unknown Source)
    at java.base/java.net.ServerSocket.platformImplAccept(Unknown Source)
    at java.base/java.net.ServerSocket.implAccept(Unknown Source)
    at java.base/java.net.ServerSocket.implAccept(Unknown Source)
    at java.base/java.net.ServerSocket.accept(Unknown Source)
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176)
    ... 29 more



In Python CLI

(base) C:\Users\ashish\Desktop\20221004\code>python
Python 3.9.12 (main, Apr  4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.

>>> import pandas as pd
>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] })

>>> sc = SparkContext.getOrCreate()

22/10/07 15:42:09 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/07 15:42:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

>>> sqlCtx = SQLContext(sc)

C:\Users\ashish\Anaconda3\lib\site-packages\pyspark\sql\context.py:112: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.
    warnings.warn(

>>> sdf = sqlCtx.createDataFrame(df)

>>> sdf.show()


Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.
22/10/07 15:43:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.base/sun.nio.ch.NioSocketImpl.timedAccept(Unknown Source)
        at java.base/sun.nio.ch.NioSocketImpl.accept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.platformImplAccept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.accept(Unknown Source)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176)
        ... 29 more
22/10/07 15:43:14 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (LAPTOPHOST.ad.company.com executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.base/sun.nio.ch.NioSocketImpl.timedAccept(Unknown Source)
        at java.base/sun.nio.ch.NioSocketImpl.accept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.platformImplAccept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.accept(Unknown Source)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176)
        ... 29 more

22/10/07 15:43:14 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):                                  (0 + 0) / 1]
    File "<stdin>", line 1, in <module>
    File "C:\Users\ashish\Anaconda3\lib\site-packages\pyspark\sql\dataframe.py", line 606, in show
    print(self._jdf.showString(n, 20, vertical))
    File "C:\Users\ashish\Anaconda3\lib\site-packages\py4j\java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
    File "C:\Users\ashish\Anaconda3\lib\site-packages\pyspark\sql\utils.py", line 190, in deco
    return f(*a, **kw)
    File "C:\Users\ashish\Anaconda3\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o46.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (LAPTOPHOST.ad.company.com executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.base/sun.nio.ch.NioSocketImpl.timedAccept(Unknown Source)
        at java.base/sun.nio.ch.NioSocketImpl.accept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.platformImplAccept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.accept(Unknown Source)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176)
        ... 29 more

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268)
        at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:506)
        at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:459)
        at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
        at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3868)
        at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2863)
        at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3858)
        at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510)
        at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3856)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856)
        at org.apache.spark.sql.Dataset.head(Dataset.scala:2863)
        at org.apache.spark.sql.Dataset.take(Dataset.scala:3084)
        at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288)
        at org.apache.spark.sql.Dataset.showString(Dataset.scala:327)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.base/java.lang.reflect.Method.invoke(Unknown Source)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        ... 1 more
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.base/sun.nio.ch.NioSocketImpl.timedAccept(Unknown Source)
        at java.base/sun.nio.ch.NioSocketImpl.accept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.platformImplAccept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.accept(Unknown Source)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176)
        ... 29 more

>>>


---------------------------

FRESH INSTALLATION USING ENV.YML FILE

(base) C:\Users\ashish\Desktop>conda env create -f env.yml


Collecting package metadata (repodata.json): done
Solving environment: done

Downloading and Extracting Packages
smart_open-6.2.0     | 44 KB     | ### | 100%
executing-1.1.0      | 23 KB     | ### | 100%
libthrift-0.15.0     | 865 KB    | ### | 100%
jsonschema-4.16.0    | 65 KB     | ### | 100%
aws-c-http-0.6.6     | 154 KB    | ### | 100%
pandas-1.5.0         | 11.7 MB   | ### | 100%
nbclient-0.7.0       | 65 KB     | ### | 100%
liblapack-3.9.0      | 5.6 MB    | ### | 100%
setuptools-65.4.1    | 776 KB    | ### | 100%
cffi-1.15.1          | 225 KB    | ### | 100%
aws-c-event-stream-0 | 47 KB     | ### | 100%
jupyter_console-6.4. | 23 KB     | ### | 100%
aws-c-s3-0.1.27      | 49 KB     | ### | 100%
matplotlib-3.6.0     | 8 KB      | ### | 100%
libpng-1.6.38        | 773 KB    | ### | 100%
libdeflate-1.14      | 73 KB     | ### | 100%
gst-plugins-base-1.2 | 2.4 MB    | ### | 100%
parquet-cpp-1.5.1    | 3 KB      | ### | 100%
scikit-learn-1.1.2   | 7.5 MB    | ### | 100%
libutf8proc-2.7.0    | 101 KB    | ### | 100%
pkgutil-resolve-name | 9 KB      | ### | 100%
pyspark-3.3.0        | 268.1 MB  | ### | 100%
kiwisolver-1.4.4     | 61 KB     | ### | 100%
importlib_resources- | 28 KB     | ### | 100%
sip-6.6.2            | 523 KB    | ### | 100%
gensim-4.2.0         | 22.5 MB   | ### | 100%
ipykernel-6.16.0     | 100 KB    | ### | 100%
certifi-2022.9.24    | 155 KB    | ### | 100%
nbformat-5.6.1       | 106 KB    | ### | 100%
libblas-3.9.0        | 5.6 MB    | ### | 100%
gflags-2.2.2         | 80 KB     | ### | 100%
aws-c-mqtt-0.7.8     | 66 KB     | ### | 100%
jupyter_client-7.3.5 | 91 KB     | ### | 100%
ipython-8.5.0        | 553 KB    | ### | 100%
pywin32-303          | 6.8 MB    | ### | 100%
qtconsole-5.3.2      | 6 KB      | ### | 100%
notebook-6.4.12      | 6.3 MB    | ### | 100%
matplotlib-base-3.6. | 7.5 MB    | ### | 100%
pyarrow-6.0.0        | 2.4 MB    | ### | 100%
soupsieve-2.3.2.post | 34 KB     | ### | 100%
glog-0.5.0           | 90 KB     | ### | 100%
asttokens-2.0.8      | 24 KB     | ### | 100%
stack_data-0.5.1     | 24 KB     | ### | 100%
tbb-2021.6.0         | 174 KB    | ### | 100%
libiconv-1.17        | 698 KB    | ### | 100%
re2-2021.11.01       | 476 KB    | ### | 100%
pillow-9.2.0         | 45.4 MB   | ### | 100%
scipy-1.9.1          | 28.2 MB   | ### | 100%
grpc-cpp-1.41.1      | 17.6 MB   | ### | 100%
joblib-1.2.0         | 205 KB    | ### | 100%
qt-main-5.15.6       | 68.8 MB   | ### | 100%
aws-crt-cpp-0.17.1   | 191 KB    | ### | 100%
tqdm-4.64.1          | 82 KB     | ### | 100%
regex-2022.9.13      | 350 KB    | ### | 100%
aws-sdk-cpp-1.9.120  | 5.5 MB    | ### | 100%
python-fastjsonschem | 242 KB    | ### | 100%
nbconvert-pandoc-7.2 | 5 KB      | ### | 100%
vs2015_runtime-14.29 | 1.2 MB    | ### | 100%
libglib-2.74.0       | 3.1 MB    | ### | 100%
gettext-0.19.8.1     | 4.7 MB    | ### | 100%
numpy-1.23.3         | 6.3 MB    | ### | 100%
nbconvert-7.2.1      | 6 KB      | ### | 100%
jupyter_core-4.11.1  | 106 KB    | ### | 100%
pywinpty-2.0.8       | 229 KB    | ### | 100%
aws-c-common-0.6.11  | 165 KB    | ### | 100%
python-3.10.6        | 16.5 MB   | ### | 100%
aws-c-auth-0.6.4     | 91 KB     | ### | 100%
nest-asyncio-1.5.6   | 10 KB     | ### | 100%
matplotlib-inline-0. | 12 KB     | ### | 100%
tzdata-2022d         | 118 KB    | ### | 100%
libtiff-4.4.0        | 1.1 MB    | ### | 100%
libprotobuf-3.18.1   | 2.3 MB    | ### | 100%
aws-c-io-0.10.9      | 127 KB    | ### | 100%
libssh2-1.10.0       | 228 KB    | ### | 100%
debugpy-1.6.3        | 3.2 MB    | ### | 100%
unicodedata2-14.0.0  | 491 KB    | ### | 100%
contourpy-1.0.5      | 176 KB    | ### | 100%
terminado-0.16.0     | 19 KB     | ### | 100%
pcre2-10.37          | 942 KB    | ### | 100%
pandoc-2.19.2        | 18.9 MB   | ### | 100%
prompt-toolkit-3.0.3 | 254 KB    | ### | 100%
glib-2.74.0          | 452 KB    | ### | 100%
importlib-metadata-4 | 34 KB     | ### | 100%
pyrsistent-0.18.1    | 86 KB     | ### | 100%
mistune-2.0.4        | 67 KB     | ### | 100%
libcblas-3.9.0       | 5.6 MB    | ### | 100%
xz-5.2.6             | 213 KB    | ### | 100%
argon2-cffi-bindings | 34 KB     | ### | 100%
jupyter-1.0.0        | 7 KB      | ### | 100%
aws-c-cal-0.5.12     | 36 KB     | ### | 100%
zlib-1.2.12          | 114 KB    | ### | 100%
libzlib-1.2.12       | 71 KB     | ### | 100%
ipywidgets-8.0.2     | 109 KB    | ### | 100%
traitlets-5.4.0      | 85 KB     | ### | 100%
jupyterlab_widgets-3 | 222 KB    | ### | 100%
libsqlite-3.39.4     | 642 KB    | ### | 100%
jinja2-3.1.2         | 99 KB     | ### | 100%
zstd-1.5.2           | 401 KB    | ### | 100%
ca-certificates-2022 | 189 KB    | ### | 100%
nbconvert-core-7.2.1 | 189 KB    | ### | 100%
pyzmq-24.0.1         | 461 KB    | ### | 100%
psutil-5.9.2         | 370 KB    | ### | 100%
click-8.1.3          | 149 KB    | ### | 100%
pip-22.2.2           | 1.5 MB    | ### | 100%
libcurl-7.85.0       | 311 KB    | ### | 100%
vc-14.2              | 14 KB     | ### | 100%
markupsafe-2.1.1     | 25 KB     | ### | 100%
pyqt-5.15.7          | 4.7 MB    | ### | 100%
arrow-cpp-6.0.0      | 15.7 MB   | ### | 100%
prompt_toolkit-3.0.3 | 5 KB      | ### | 100%
pygments-2.13.0      | 821 KB    | ### | 100%
bleach-5.0.1         | 124 KB    | ### | 100%
jedi-0.18.1          | 799 KB    | ### | 100%
py4j-0.10.9.5        | 181 KB    | ### | 100%
aws-c-compression-0. | 20 KB     | ### | 100%
gstreamer-1.20.3     | 2.2 MB    | ### | 100%
tornado-6.2          | 666 KB    | ### | 100%
statsmodels-0.13.2   | 10.4 MB   | ### | 100%
fonttools-4.37.4     | 1.7 MB    | ### | 100%
pyqt5-sip-12.11.0    | 82 KB     | ### | 100%
widgetsnbextension-4 | 1.6 MB    | ### | 100%
attrs-22.1.0         | 48 KB     | ### | 100%
pytz-2022.4          | 232 KB    | ### | 100%
qtconsole-base-5.3.2 | 91 KB     | ### | 100%
qtpy-2.2.1           | 49 KB     | ### | 100%
glib-tools-2.74.0    | 168 KB    | ### | 100%
aws-checksums-0.1.12 | 51 KB     | ### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Installing pip dependencies: | Ran pip subprocess with arguments:
['C:\\Users\\ashish\\Anaconda3\\envs\\mh\\python.exe', '-m', 'pip', 'install', '-U', '-r', 'C:\\Users\\ashish\\Desktop\\condaenv.ug1b_vpf.requirements.txt']
Pip subprocess output:
Collecting rpy2==3.4.5
    Using cached rpy2-3.4.5.tar.gz (194 kB)
    Preparing metadata (setup.py): started
    Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: cffi>=1.10.0 in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.ug1b_vpf.requirements.txt (line 1)) (1.15.1)
Requirement already satisfied: jinja2 in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.ug1b_vpf.requirements.txt (line 1)) (3.1.2)
Requirement already satisfied: pytz in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.ug1b_vpf.requirements.txt (line 1)) (2022.4)
Collecting tzlocal
    Using cached tzlocal-4.2-py3-none-any.whl (19 kB)
Requirement already satisfied: pycparser in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from cffi>=1.10.0->rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.ug1b_vpf.requirements.txt (line 1)) (2.21)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\ashish\anaconda3\envs\mh\lib\site-packages (from jinja2->rpy2==3.4.5->-r C:\Users\ashish\Desktop\condaenv.ug1b_vpf.requirements.txt (line 1)) (2.1.1)
Collecting tzdata
    Using cached tzdata-2022.4-py2.py3-none-any.whl (336 kB)
Collecting pytz-deprecation-shim
    Using cached pytz_deprecation_shim-0.1.0.post0-py2.py3-none-any.whl (15 kB)
Building wheels for collected packages: rpy2
    Building wheel for rpy2 (setup.py): started
    Building wheel for rpy2 (setup.py): finished with status 'done'
    Created wheel for rpy2: filename=rpy2-3.4.5-py3-none-any.whl size=198842 sha256=9b472eb2c0a65535eac19151a43e5d6fc3dbe5930a3953d76fcb2b170c8106ee
    Stored in directory: c:\users\ashish\appdata\local\pip\cache\wheels\ba\d8\8b\68fc240578a71188d0ca04b6fe8a58053fbcbcfbe2a3cbad12
Successfully built rpy2
Installing collected packages: tzdata, pytz-deprecation-shim, tzlocal, rpy2
Successfully installed pytz-deprecation-shim-0.1.0.post0 rpy2-3.4.5 tzdata-2022.4 tzlocal-4.2

done
#
# To activate this environment, use
#
#     $ conda activate mh
#
# To deactivate an active environment, use
#
#     $ conda deactivate

Retrieving notices: ...working... done


(base) C:\Users\ashish\Desktop>

(base) C:\Users\ashish\Desktop>
(base) C:\Users\ashish\Desktop>conda activate mh

(mh) C:\Users\ashish\Desktop>python -m ipykernel install --user --name mh
Installed kernelspec mh in C:\Users\ashish\AppData\Roaming\jupyter\kernels\mh


Same Error

(base) C:\Users\ashish\Desktop\20221004\code>conda activate mh

(mh) C:\Users\ashish\Desktop\20221004\code>python
Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:30:19) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> df = pd.DataFrame({ "col1": ["val1"], "col2": ["val2"] })
>>> sc = SparkContext.getOrCreate()
22/10/07 16:30:36 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/07 16:30:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>> sqlCtx = SQLContext(sc)
C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\pyspark\sql\context.py:112: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.
    warnings.warn(
>>> sdf = sqlCtx.createDataFrame(df)
C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\pyspark\sql\pandas\conversion.py:474: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
    for column, series in pdf.iteritems():
C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\pyspark\sql\pandas\conversion.py:486: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
    for column, series in pdf.iteritems():
>>>
>>> sdf.show()


Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.
22/10/07 16:32:04 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.base/sun.nio.ch.NioSocketImpl.timedAccept(Unknown Source)
        at java.base/sun.nio.ch.NioSocketImpl.accept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.platformImplAccept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.accept(Unknown Source)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176)
        ... 29 more
22/10/07 16:32:04 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (LAPTOPHOST.ad.company.com executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.base/sun.nio.ch.NioSocketImpl.timedAccept(Unknown Source)
        at java.base/sun.nio.ch.NioSocketImpl.accept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.platformImplAccept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.accept(Unknown Source)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176)
        ... 29 more

22/10/07 16:32:04 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\pyspark\sql\dataframe.py", line 606, in show
    print(self._jdf.showString(n, 20, vertical))
    File "C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\py4j\java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
    File "C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\pyspark\sql\utils.py", line 190, in deco
    return f(*a, **kw)
    File "C:\Users\ashish\Anaconda3\envs\mh\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o46.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (LAPTOPHOST.ad.company.com executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.base/sun.nio.ch.NioSocketImpl.timedAccept(Unknown Source)
        at java.base/sun.nio.ch.NioSocketImpl.accept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.platformImplAccept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.accept(Unknown Source)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176)
        ... 29 more

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268)
        at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:506)
        at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:459)
        at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
        at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3868)
        at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2863)
        at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3858)
        at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510)
        at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3856)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856)
        at org.apache.spark.sql.Dataset.head(Dataset.scala:2863)
        at org.apache.spark.sql.Dataset.take(Dataset.scala:3084)
        at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288)
        at org.apache.spark.sql.Dataset.showString(Dataset.scala:327)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.base/java.lang.reflect.Method.invoke(Unknown Source)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        ... 1 more
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.base/sun.nio.ch.NioSocketImpl.timedAccept(Unknown Source)
        at java.base/sun.nio.ch.NioSocketImpl.accept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.platformImplAccept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.implAccept(Unknown Source)
        at java.base/java.net.ServerSocket.accept(Unknown Source)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176)
        ... 29 more
>>>

Wednesday, October 5, 2022

Social Analysis (SOAN) based on Whatsapp data (Project Setup)

This project is being developed by: Maarten Grootendorst
Project Link: SOAN: GitHub    

In this post, we share a minimal YAML file to setup an environment for running SOAN project.


name: soan
channels:
  - conda-forge
dependencies:
  - pandas
  - seaborn
  - scikit-learn
  - matplotlib
  - emoji
  - pattern
  - wordcloud
  - palettable
  - ipykernel
  - jupyter
    


Run Logs


(base) ashish@ashish-Lenovo-ideapad-130-15IKB:~/Desktop$ conda env create -f env.yml
Collecting package metadata (repodata.json): done
Solving environment: done

Downloading and Extracting Packages
contourpy-1.0.5      | 234 KB    | ### | 100% 
idna-3.4             | 55 KB     | ### | 100% 
libiconv-1.17        | 1.4 MB    | ### | 100% 
stack_data-0.5.1     | 24 KB     | ### | 100% 
pyopenssl-22.0.0     | 120 KB    | ### | 100% 
scipy-1.9.1          | 26.2 MB   | ### | 100% 
jack-1.9.18          | 647 KB    | ### | 100% 
mysqlclient-2.0.3    | 82 KB     | ### | 100% 
mysql-libs-8.0.30    | 1.9 MB    | ### | 100% 
fontconfig-2.14.0    | 318 KB    | ### | 100% 
glib-tools-2.74.0    | 108 KB    | ### | 100% 
fonttools-4.37.4     | 2.0 MB    | ### | 100% 
qtpy-2.2.1           | 49 KB     | ### | 100% 
python-fastjsonschem | 242 KB    | ### | 100% 
libcap-2.65          | 97 KB     | ### | 100% 
ca-certificates-2022 | 150 KB    | ### | 100% 
gstreamer-1.20.3     | 2.0 MB    | ### | 100% 
jaraco.context-4.1.2 | 9 KB      | ### | 100% 
repoze.lru-0.7       | 13 KB     | ### | 100% 
matplotlib-3.6.0     | 7 KB      | ### | 100% 
jaraco.classes-3.2.2 | 10 KB     | ### | 100% 
glib-2.74.0          | 438 KB    | ### | 100% 
qtconsole-5.3.2      | 6 KB      | ### | 100% 
qtconsole-base-5.3.2 | 91 KB     | ### | 100% 
executing-1.1.0      | 23 KB     | ### | 100% 
pyzmq-24.0.1         | 510 KB    | ### | 100% 
nest-asyncio-1.5.6   | 10 KB     | ### | 100% 
qt-main-5.15.6       | 61.5 MB   | ### | 100% 
fftw-3.3.10          | 2.2 MB    | ### | 100% 
sqlite-3.39.4        | 789 KB    | ### | 100% 
libzlib-1.2.12       | 65 KB     | ### | 100% 
routes-2.5.1         | 35 KB     | ### | 100% 
portaudio-19.6.0     | 132 KB    | ### | 100% 
regex-2022.9.13      | 386 KB    | ### | 100% 
zc.lockfile-2.0      | 12 KB     | ### | 100% 
nbconvert-pandoc-7.1 | 5 KB      | ### | 100% 
cherrypy-18.8.0      | 504 KB    | ### | 100% 
widgetsnbextension-4 | 1.6 MB    | ### | 100% 
jaraco.collections-3 | 14 KB     | ### | 100% 
pdfminer.six-2022052 | 4.9 MB    | ### | 100% 
jupyterlab_widgets-3 | 222 KB    | ### | 100% 
pcre2-10.37          | 1.1 MB    | ### | 100% 
setuptools-65.4.1    | 776 KB    | ### | 100% 
pytz-2022.4          | 232 KB    | ### | 100% 
certifi-2022.9.24    | 155 KB    | ### | 100% 
nbconvert-core-7.1.0 | 189 KB    | ### | 100% 
expat-2.4.9          | 189 KB    | ### | 100% 
libdeflate-1.14      | 81 KB     | ### | 100% 
tzdata-2022d         | 118 KB    | ### | 100% 
numpy-1.23.3         | 7.0 MB    | ### | 100% 
ipywidgets-8.0.2     | 109 KB    | ### | 100% 
future-0.18.2        | 738 KB    | ### | 100% 
terminado-0.16.0     | 18 KB     | ### | 100% 
gettext-0.19.8.1     | 3.6 MB    | ### | 100% 
simplejson-3.17.6    | 104 KB    | ### | 100% 
nbformat-5.6.1       | 106 KB    | ### | 100% 
nbconvert-7.1.0      | 6 KB      | ### | 100% 
mysql-common-8.0.30  | 1.9 MB    | ### | 100% 
pulseaudio-14.0      | 1.7 MB    | ### | 100% 
libsqlite-3.39.4     | 803 KB    | ### | 100% 
cryptography-38.0.1  | 1.6 MB    | ### | 100% 
libglib-2.74.0       | 3.1 MB    | ### | 100% 
alsa-lib-1.2.7.2     | 581 KB    | ### | 100% 
jaraco.text-3.9.1    | 21 KB     | ### | 100% 
ipykernel-6.16.0     | 100 KB    | ### | 100% 
gst-plugins-base-1.2 | 2.8 MB    | ### | 100% 
prompt_toolkit-3.0.3 | 5 KB      | ### | 100% 
libtiff-4.4.0        | 651 KB    | ### | 100% 
libxml2-2.10.2       | 727 KB    | ### | 100% 
joblib-1.2.0         | 205 KB    | ### | 100% 
matplotlib-base-3.6. | 7.5 MB    | ### | 100% 
wordcloud-1.8.2.2    | 184 KB    | ### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate soan
#
# To deactivate an active environment, use
#
#     $ conda deactivate

Retrieving notices: ...working... done


Setup the Jupyter Lab Kernel


(base) ashish@ashish-Lenovo-ideapad-130-15IKB:~/Desktop$ conda activate soan
(soan) ashish@ashish-Lenovo-ideapad-130-15IKB:~/Desktop$ python -m ipykernel install --user --name soan

Installed kernelspec soan in /home/ashish/.local/share/jupyter/kernels/soan

(soan) ashish@ashish-Lenovo-ideapad-130-15IKB:~/Desktop$ jupyter kernelspec list

Available kernels:
  mongodb                    /home/ashish/.local/share/jupyter/kernels/mongodb
  rasa_py38                  /home/ashish/.local/share/jupyter/kernels/rasa_py38
  soan                       /home/ashish/.local/share/jupyter/kernels/soan
  stock_market_prediction    /home/ashish/.local/share/jupyter/kernels/stock_market_prediction
  python3                    /home/ashish/anaconda3/envs/soan/share/jupyter/kernels/python3

Social Analysis (SOAN using Python 3) Report

This post (Dated: 21 Apr 2020) is in continuation from this post Social Analysis (SOAN) of WhatsApp Chats (an NLP and Pandas application)
    
Following information could be deciphered from the chat history:

1: Top 20 users based on message count

    


2: Top 20 Users based on Word Count 

    

3: Line plots for "weekly number of messages by the users" 
Here only top 5 users are shown.

    

4: Weekly Number of Messages 

    

5: Growth in Total Number of Messages on Weekly Basis 

    
: Number of messages a user has sent on each week day 

Sunday: 326
Monday: 162
Thursday: 264
Friday: 348
Saturday: 405
Tuesday: 534
Wednesday: 157

: Line plot for messages sent in each hour 




: Number of messages sent by any user in each hour for the overall period of time 

[(0, 98),
 (1, 10),
 (2, 4),
 (3, 2),
 (8, 15),
 (9, 63),
 (10, 91),
 (11, 79),
 (12, 163),
 (13, 207),
 (14, 81),
 (15, 90),
 (16, 114),
 (17, 93),
 (18, 78),
 (19, 106),
 (20, 148),
 (21, 233),
 (22, 292),
 (23, 229)]

Social Analysis (SOAN) of WhatsApp Chats

As of Jan 2020, The SOAN code requires Python 2.7.
We would soon have to migrate this code Python 3.X because Python 2.7 will reach the end of its life on January 1st, 2020.

STEP 1: We will create a Python 2.7.16 environment and a kernel for Anaconda as shown in this link "https://survival8.blogspot.com/p/installing-new-kernel-in-jupyter.html"

STEP 2: Install the necessary Python packages.

For this, I am generating a requirement.txt file using "pip freeze" command in my already set up environment as shown below:

(py_2716) D:\>pip freeze > requirements.txt 

On your machine that you need to set up, run the following command from directory having the "requirements.txt" file.

(py_2716) D:\>pip install -r requirements.txt 

A "pip freeze" of my py_2716 environment shows following packages:

backports-abc==0.5
backports.csv==1.0.7
backports.functools-lru-cache==1.5
backports.shutil-get-terminal-size==1.0.0
beautifulsoup4==4.7.1
certifi==2019.6.16
chardet==3.0.4
cheroot==6.5.5
CherryPy==17.4.2
colorama==0.4.1
contextlib2==0.5.5
cycler==0.10.0
decorator==4.4.0
emoji==0.5.2
enum34==1.1.6
feedparser==5.2.1
future==0.17.1
futures==3.3.0
idna==2.8
ipykernel==4.10.0
ipython==5.8.0
ipython-genutils==0.2.0
jaraco.functools==2.0
jupyter-client==5.3.1
jupyter-core==4.5.0
kiwisolver==1.1.0
lxml==4.3.4
matplotlib==2.2.4
more-itertools==5.0.0
mysqlclient==1.4.2
nltk==3.4.4
numpy==1.16.4
palettable==3.2.0
pandas==0.24.2
pathlib2==2.3.4
Pattern==3.6
pdfminer==20140328
pickleshare==0.7.5
Pillow==6.1.0
portend==2.5
prompt-toolkit==1.0.16
Pygments==2.4.2
pyparsing==2.4.0
python-dateutil==2.8.0
python-docx==0.8.10
pytz==2019.1
pywin32==224
pyzmq==18.0.2
regex==2019.6.8
requests==2.22.0
scandir==1.10.0
scikit-learn==0.20.3
scipy==1.2.2
seaborn==0.9.0
simplegeneric==0.8.1
singledispatch==3.4.0.3
six==1.12.0
sklearn==0.0
soupsieve==1.9.2
tempora==1.14.1
tornado==5.1.1
traitlets==4.3.2
urllib3==1.25.3
wcwidth==0.1.7
win-unicode-console==0.5
wincertstore==0.2
wordcloud==1.5.0
zc.lockfile==1.4

STEP 3: How we have generated data?

We have exported a particular group chat for this demo. This is done via the email option that comes in WhatsApp app. Exported file is a text file "WhatsApp Chat with Cousins (201910-201912).txt".

STEP 4:
The entire code is present on this Google drive link: https://drive.google.com/open?id=1r8oY5BhAlsG0womnWH5awQMdLqSKQebO

Python Packages Useful For Natural Language Processing

1. nltk

The Natural Language Toolkit (NLTK) is a Python package for natural language processing. NLTK requires Python 3.7, 3.8, 3.9 or 3.10.

As in:

1.1. from nltk.sentiment.vader import SentimentIntensityAnalyzer
1.2. from nltk.stem import WordNetLemmatizer
1.3. from nltk.corpus import stopwords
1.4. from nltk.tokenize import word_tokenize

2. scikit-learn 

One of its most popular usages in NLP:

2.1. from sklearn.feature_extraction.text import TfidfVectorizer
2.2. from sklearn.metrics.pairwise import cosine_similarity
2.3. from sklearn.manifold import TSNE
2.4. from sklearn.decomposition import LatentDirichletAllocation, PCA
2.5. from sklearn.cluster import AgglomerativeClustering

Ref: Math with words 

3. spaCy: Industrial-strength NLP

spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products.

spaCy comes with pretrained pipelines and currently supports tokenization and training for 60+ languages. It features state-of-the-art speed and neural network models for tagging, parsing, named entity recognition, text classification and more, multi-task learning with pretrained transformers like BERT, as well as a production-ready training system and easy model packaging, deployment and workflow management. spaCy is commercial open-source software, released under the MIT license.

Our use case involved NER capability of spaCy:

Ref: 
% Exploring Word2Vec 
% Python Code to Create Annotations For SpaCy NER 

4. gensim

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.

Ref: PyPI 

Our usecase involved:
4.1. from gensim.models import Word2Vec
4.2. from gensim.corpora.dictionary import Dictionary
4.3. from gensim.models.lsimodel import LsiModel, stochastic_svd
4.4. from gensim.models.coherencemodel import CoherenceModel
4.5. from gensim.models.ldamodel import LdaModel # Latent Dirichlet Allocation and not 'Latent Discriminant Analysis'
4.6. from gensim.models import RpModel
4.7. from gensim.matutils import corpus2dense, Dense2Corpus
4.8. from gensim.test.utils import common_texts

5. word2vec

Python interface to Google word2vec.

Training is done using the original C code, other functionality is pure Python with numpy.

6. GloVe

Cython general implementation of the Glove multi-threaded training.

GloVe is an unsupervised learning algorithm for generating vector representations for words.
Training is done using a co-occcurence matrix from a corpus. The resulting representations contain structure useful for many other tasks.

The paper describing the model is [here].

The original implementation for this Machine Learning model can be [found here].

7. fastText

fastText is a library for efficient learning of word representations and sentence classification.

Ref:
% PyPI 
% Reasoning with Word Vectors

8. TextWiser: Text Featurization Library

TextWiser (AAAI'21) is a research library that provides a unified framework for text featurization based on a rich set of methods while taking advantage of pretrained models provided by the state-of-the-art libraries.

The main contributions include:

Rich Set of Embeddings: A wide range of available embeddings and transformations to choose from.

Fine-Tuning: Designed to support a PyTorch backend, and hence, retains the ability to fine-tune featurizations for downstream tasks. That means, if you pass the resulting fine-tunable embeddings to a training method, the features will be optimized automatically for your application.

Parameter Optimization: Interoperable with the standard scikit-learn pipeline for hyper-parameter tuning and rapid experimentation. All underlying parameters are exposed to the user.

Grammar of Embeddings: Introduces a novel approach to design embeddings from components. The compound embedding allows forming arbitrarily complex embeddings in accordance with a context-free grammar that defines a formal language for valid text featurization.

GPU Native: Built with GPUs in mind. If it detects available hardware, the relevant models are automatically placed on the GPU.

TextWiser is developed by the Artificial Intelligence Center of Excellence at Fidelity Investments. Documentation is available at fidelity.github.io/textwiser. Here is the video of the paper presentation at AAAI 2021.

Our Usecase Involved:
Document Embeddings (Doc2Vec): Supported by gensim
% Defaults to training from scratch

Ref: PyPI 

9. BERT-As-a-Service 

pip install bert-serving-server  # server
pip install bert-serving-client  # client, independent of `bert-serving-server` 

Ref: 
% Getting Started with BERT-As-a-Service 
% Word Embeddings Using BERT (Demo of BERT-As-a-Service)

10. transformers

State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow:
Transformers provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio.

These models can be applied on:

- Text, for tasks like text classification, information extraction, question answering, summarization, translation, text generation, in over 100 languages.
- Images, for tasks like image classification, object detection, and segmentation.
- Audio, for tasks like speech recognition and audio classification.

Transformer models can also perform tasks on several modalities combined, such as table question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering.

Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and can be modified to enable quick research experiments.

Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration between them. It's straightforward to train your models with one before loading them for inference with the other.

Ref: 
% Word Embeddings using BERT and testing using Word Analogies, Nearest Words, 1D Spectrum 
% PyPI 
% conda-forge 

11. torch

PyTorch is a Python package that provides two high-level features:

% Tensor computation (like NumPy) with strong GPU acceleration
% Deep neural networks built on a tape-based autograd system

You can reuse your favorite Python packages such as NumPy, SciPy, and Cython to extend PyTorch when needed.

Ref: PyPI 

12. sentence-transformers

Sentence Transformers: Multilingual Sentence, Paragraph, and Image Embeddings using BERT & Co.

This framework provides an easy method to compute dense vector representations for sentences, paragraphs, and images. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various task. Text is embedding in vector space such that similar text is close and can efficiently be found using cosine similarity.

We provide an increasing number of state-of-the-art pretrained models for more than 100 languages, fine-tuned for various use-cases.

Further, this framework allows an easy fine-tuning of custom embeddings models, to achieve maximal performance on your specific task.

For the full documentation, see www.SBERT.net.

The following publications are integrated in this framework:

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (EMNLP 2019)

Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation (EMNLP 2020)

Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks (NAACL 2021)
    
The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes (arXiv 2020)
    
TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning (arXiv 2021)
    
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models (arXiv 2021)

Ref: 

% PyPI 
% conda-forge 

13. Scrapy

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy is maintained by Zyte (formerly Scrapinghub) and many other contributors.

Check the Scrapy homepage at https://scrapy.org for more information, including a list of features.

Requirements:

% Python 3.6+
% Works on Linux, Windows, macOS, BSD

14. Rasa

% Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more
% Create chatbots and voice assistants

Ref: Rasa 

15. Sentiment Analysis using BERT, DistilBERT and ALBERT

Sentiment analysis neural network trained by fine-tuning BERT, ALBERT, or DistilBERT on the Stanford Sentiment Treebank. 

Ref: 
% barissayil/SentimentAnalysis
% Sentiment Analysis Using BERT 

16. pyLDAvis

Python library for interactive topic model visualization. This is a port of the fabulous R package by Carson Sievert and Kenny Shirley.

pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

The visualization is intended to be used within an IPython notebook but can also be saved to a stand-alone HTML file for easy sharing.

Note: LDA stands for latent Dirichlet allocation.

Ref:
% PyPI 
% Creating taxonomy for BBC news articles 

17. scipy

As in for cosine distance calculation:

from scipy.spatial.distance import cosine

Note:
from sklearn.metrics.pairwise import cosine_similarity 
# Expects 2D arrays as input

from scipy.spatial.distance import cosine 
# Works with 1D vectors

18. twitter

The Minimalist Twitter API for Python is a Python API for Twitter, everyone's favorite Web 2.0 Facebook-style status updater for people on the go.

Also included is a Twitter command-line tool for getting your friends' tweets and setting your own tweet from the safety and security of your favorite shell and an IRC bot that can announce Twitter updates to an IRC channel.

Ref:
% https://pypi.org/project/twitter/
% https://survival8.blogspot.com/2022/09/using-twitter-api-to-fetch-trending.html

19. spark-nlp

Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides simple, performant & accurate NLP annotations for machine learning pipelines that scale easily in a distributed environment. Spark NLP comes with 11000+ pretrained pipelines and models in more than 200+ languages. It also offers tasks such as Tokenization, Word Segmentation, Part-of-Speech Tagging, Word and Sentence Embeddings, Named Entity Recognition, Dependency Parsing, Spell Checking, Text Classification, Sentiment Analysis, Token Classification, Machine Translation (+180 languages), Summarization, Question Answering, Table Question Answering, Text Generation, Image Classification, Automatic Speech Recognition, and many more NLP tasks.

Spark NLP is the only open-source NLP library in production that offers state-of-the-art transformers such as BERT, CamemBERT, ALBERT, ELECTRA, XLNet, DistilBERT, RoBERTa, DeBERTa, XLM-RoBERTa, Longformer, ELMO, Universal Sentence Encoder, Google T5, MarianMT, GPT2, and Vision Transformers (ViT) not only to Python and R, but also to JVM ecosystem (Java, Scala, and Kotlin) at scale by extending Apache Spark natively.

Ref: https://pypi.org/project/spark-nlp/

20. keras-transformer



Popular Usage: Machine Translation

Ref:
# PyPI
# GitHub

21. pronouncing

Pronouncing is a simple interface for the CMU Pronouncing Dictionary. It’s easy to use and has no external dependencies. For example, here’s how to find rhymes for a given word:

>>> import pronouncing
>>> pronouncing.rhymes("climbing")
['diming', 'liming', 'priming', 'rhyming', 'timing']

Ref: https://pypi.org/project/pronouncing/

22. random-word

This is a simple python package to generate random English words. 

23. langdetect

Port of Nakatani Shuyo's language-detection library (version from 03/03/2014) to Python.

langdetect supports 55 languages out of the box (ISO 639-1 codes):

af, ar, bg, bn, ca, cs, cy, da, de, el, en (English), es, et, fa, fi, fr, gu, he, hi (Hindi), hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl, pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn, zh-tw

Ref: https://pypi.org/project/langdetect/

24. PyPDF2

PyPDF2 is a free and open-source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. PyPDF2 can retrieve text and metadata from PDFs as well.

25. python-docx

python-docx is a Python library for creating and updating Microsoft Word (.docx) files.

Installation:
$ pip install python-docx 

Usage:
import docx 

Note: Does not support .pdf and .doc

Ref: 
% github 
% Convert MS Word files into PDF format 

26. emoji

Emoji for Python. This project was inspired by kyokomi.

The entire set of Emoji codes as defined by the Unicode consortium is supported in addition to a bunch of aliases. By default, only the official list is enabled but doing emoji.emojize(language='alias') enables both the full list and aliases.

Ref:
% PyPI
% conda-forge
% Social Analysis (SOAN using Python 3) Report

27. pattern

Web mining module for Python.

Ref:
% PyPI
% conda-forge

28. wordcloud

A little word cloud generator in Python. Read more about it on the blog post or the website.
The code is tested against Python 2.7, 3.4, 3.5, 3.6 and 3.7. [Dated: 20221005]

Installation with pip3

$ pip3 install wordcloud

Collecting wordcloud
Downloading wordcloud-1.8.2.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (458 kB)
    |████████████████████████████████| 458 kB 13 kB/s 
Requirement already satisfied: numpy>=1.6.1 in /home/ashish/anaconda3/lib/python3.9/site-packages (from wordcloud) (1.21.5)
Requirement already satisfied: matplotlib in /home/ashish/anaconda3/lib/python3.9/site-packages (from wordcloud) (3.5.1)
Requirement already satisfied: pillow in /home/ashish/anaconda3/lib/python3.9/site-packages (from wordcloud) (9.0.1)
Requirement already satisfied: cycler>=0.10 in /home/ashish/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /home/ashish/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (4.25.0)
Requirement already satisfied: packaging>=20.0 in /home/ashish/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (21.3)
Requirement already satisfied: python-dateutil>=2.7 in /home/ashish/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (2.8.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/ashish/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (1.3.2)
Requirement already satisfied: pyparsing>=2.2.1 in /home/ashish/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (3.0.4)
Requirement already satisfied: six>=1.5 in /home/ashish/anaconda3/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.8.2.2  


Ref:
% PyPI
% conda-forge

29. Social Analysis (SOAN)

Social Analysis based on Whatsapp data 

Ref: GitHub

Monday, October 3, 2022

The RulePolicy of Rasa (Oct 2022)

You would come across the RulePolicy configuration of Rasa if you ever land up on this Rasa documentation page: https://rasa.com/docs/rasa/policies/

The page for Forms says:

Usage

To use forms with Rasa Open Source you need to make sure that the Rule Policy is added to your policy configuration. For example:


policies:
- name: RulePolicy    


To find out the lack of proper logging and getting to understand the erroneous situation that this policy can land you in, do the following.

Do a "$ rasa init" to initialize a project.

Add the following code to the "config.yml":


policies:
 - name: RulePolicy    


Next, do:


$ rasa train 
$ rasa interactive


Next we are going to show chat logs 'with the RulePolicy configuration' and 'without the RulePolicy configuration'

Without the RulePolicy configuration (this is the default setting you get from '$ rasa init')


(base) ashish@ashish-Lenovo-ideapad-130-15IKB:~/Desktop/ws/jupyter/rasa/project2$ rasa interactive
2022-10-04 00:09:02 INFO     numexpr.utils  - NumExpr defaulting to 4 threads.
The configuration for policies and pipeline was chosen automatically. It was written into the config file at 'config.yml'.
/home/ashish/.local/lib/python3.8/site-packages/matplotlib/__init__.py:169: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    if LooseVersion(module.__version__) < minver:
/home/ashish/anaconda3/envs/rasa_py38/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    other = LooseVersion(other)
/home/ashish/.local/lib/python3.8/site-packages/tensorflow_addons/utils/ensure_tf_install.py:47: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    min_version = LooseVersion(INCLUSIVE_MIN_TF_VERSION)
2022-10-04 00:09:09 INFO     rasa.engine.training.hooks  - Restored component 'CountVectorsFeaturizer' from cache.
2022-10-04 00:09:09 INFO     rasa.engine.training.hooks  - Restored component 'CountVectorsFeaturizer' from cache.
2022-10-04 00:09:09 INFO     rasa.engine.training.hooks  - Restored component 'DIETClassifier' from cache.
2022-10-04 00:09:09 INFO     rasa.engine.training.hooks  - Restored component 'EntitySynonymMapper' from cache.
2022-10-04 00:09:09 INFO     rasa.engine.training.hooks  - Restored component 'LexicalSyntacticFeaturizer' from cache.
2022-10-04 00:09:09 INFO     rasa.engine.training.hooks  - Restored component 'MemoizationPolicy' from cache.
2022-10-04 00:09:09 INFO     rasa.engine.training.hooks  - Restored component 'RegexFeaturizer' from cache.
2022-10-04 00:09:09 INFO     rasa.engine.training.hooks  - Restored component 'ResponseSelector' from cache.
2022-10-04 00:09:09 INFO     rasa.engine.training.hooks  - Restored component 'RulePolicy' from cache.
2022-10-04 00:09:09 INFO     rasa.engine.training.hooks  - Restored component 'TEDPolicy' from cache.
2022-10-04 00:09:09 INFO     rasa.engine.training.hooks  - Restored component 'UnexpecTEDIntentPolicy' from cache.
Your Rasa model is trained and saved at 'models/20221004-000906-cream-hoops.tar.gz'.
/home/ashish/.local/lib/python3.8/site-packages/colorclass/codes.py:5: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working
    from collections import Mapping
/home/ashish/.local/lib/python3.8/site-packages/sanic_cors/extension.py:39: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    SANIC_VERSION = LooseVersion(sanic_version)
2022-10-04 00:09:24 INFO     rasa.core.processor  - Loading model models/20221004-000906-cream-hoops.tar.gz...
2022-10-04 00:09:51 WARNING  rasa.shared.utils.common  - The UnexpecTED Intent Policy is currently experimental and might change or be removed in the future 🔬 Please share your feedback on it in the forum (https://forum.rasa.com) to help us make this feature ready for production.
2022-10-04 00:10:04 INFO     root  - Rasa server is up and running.
Processed story blocks: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1446.98it/s, # trackers=1]
Processed rules: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2095.58it/s, # trackers=1]
Bot loaded. Visualisation at http://localhost:5006/visualization.html .
Type a message and press enter (press 'Ctrl-c' to exit).
? Your input -> hi
? Your NLU model classified 'hi' with intent 'greet' and there are no entities, is this correct? Yes
------
Chat History

    #    Bot                        You        
────────────────────────────────────────────
    1    action_listen                         
────────────────────────────────────────────
    2                                       hi 
                            intent: greet 1.00 


Current slots: 
    session_started_metadata: None

------
? The bot wants to run 'utter_greet', correct? Yes
/home/ashish/.local/lib/python3.8/site-packages/rasa/server.py:860: FutureWarning: The "POST /conversations/<conversation_id>/execute" endpoint is deprecated. Inserting actions to the tracker externally should be avoided. Actions should be predicted by the policies only.
    rasa.shared.utils.io.raise_warning(
------
Chat History



    #    Bot                            You        
────────────────────────────────────────────────
    1    action_listen                             
────────────────────────────────────────────────
    2                                           hi 
                                intent: greet 1.00 
────────────────────────────────────────────────
    3    utter_greet 1.00                          
        Hey! How are you?                         
------
Current slots: 
    session_started_metadata: None

? The bot wants to run 'action_listen', correct? Yes
? Your input -> great
? Your NLU model classified 'great' with intent 'mood_great' and there are no entities, is this correct? Yes
------
Chat History



------
    #    Bot                                  You        
──────────────────────────────────────────────────────
    1    action_listen                                   
──────────────────────────────────────────────────────
    2                                                 hi 
                                    intent: greet 1.00 
──────────────────────────────────────────────────────
    3    utter_greet 1.00                                
        Hey! How are you?                               
        action_listen 1.00                              
──────────────────────────────────────────────────────
    4                                              great 
                                intent: mood_great 1.00 
Current slots: 
    session_started_metadata: None

? The bot wants to run 'utter_happy', correct? Yes
------
Chat History



    #    Bot                                  You        
──────────────────────────────────────────────────────
    1    action_listen                                   
──────────────────────────────────────────────────────
    2                                                 hi 
                                    intent: greet 1.00 
──────────────────────────────────────────────────────
    3    utter_greet 1.00                                
        Hey! How are you?                               
        action_listen 1.00                              
──────────────────────────────────────────────────────
    4                                              great 
                                intent: mood_great 1.00 
──────────────────────────────────────────────────────
    5    utter_happy 1.00                                
        Great, carry on!                                
------
Current slots: 
    session_started_metadata: None

? The bot wants to run 'action_listen', correct? Yes
? Your input ->


Logs with the RulePolicy configuration
Erroneous Logs (Part 1)


(base) ashish@ashish-Lenovo-ideapad-130-15IKB:~/Desktop/ws/jupyter/rasa/project2$ rasa interactive
2022-10-03 23:35:00 INFO     numexpr.utils  - NumExpr defaulting to 4 threads.
The configuration for pipeline was chosen automatically. It was written into the config file at 'config.yml'.
/home/ashish/.local/lib/python3.8/site-packages/matplotlib/__init__.py:169: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    if LooseVersion(module.__version__) < minver:
/home/ashish/anaconda3/envs/rasa_py38/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    other = LooseVersion(other)
/home/ashish/.local/lib/python3.8/site-packages/tensorflow_addons/utils/ensure_tf_install.py:47: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    min_version = LooseVersion(INCLUSIVE_MIN_TF_VERSION)
2022-10-03 23:35:08 INFO     rasa.engine.training.hooks  - Restored component 'CountVectorsFeaturizer' from cache.
2022-10-03 23:35:08 INFO     rasa.engine.training.hooks  - Restored component 'CountVectorsFeaturizer' from cache.
2022-10-03 23:35:08 INFO     rasa.engine.training.hooks  - Restored component 'DIETClassifier' from cache.
2022-10-03 23:35:08 INFO     rasa.engine.training.hooks  - Restored component 'EntitySynonymMapper' from cache.
2022-10-03 23:35:08 INFO     rasa.engine.training.hooks  - Restored component 'LexicalSyntacticFeaturizer' from cache.
2022-10-03 23:35:08 INFO     rasa.engine.training.hooks  - Restored component 'RegexFeaturizer' from cache.
2022-10-03 23:35:08 INFO     rasa.engine.training.hooks  - Restored component 'ResponseSelector' from cache.
2022-10-03 23:35:08 INFO     rasa.engine.training.hooks  - Restored component 'RulePolicy' from cache.
Your Rasa model is trained and saved at 'models/20221003-233504-energetic-palace.tar.gz'.
/home/ashish/.local/lib/python3.8/site-packages/colorclass/codes.py:5: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working
    from collections import Mapping
/home/ashish/.local/lib/python3.8/site-packages/sanic_cors/extension.py:39: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    SANIC_VERSION = LooseVersion(sanic_version)
2022-10-03 23:35:20 INFO     rasa.core.processor  - Loading model models/20221003-233504-energetic-palace.tar.gz...
2022-10-03 23:35:33 INFO     root  - Rasa server is up and running.
Processed story blocks: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1607.63it/s, # trackers=1]
Processed rules: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2113.53it/s, # trackers=1]
Bot loaded. Visualisation at http://localhost:5006/visualization.html .
Type a message and press enter (press 'Ctrl-c' to exit).
? Your input -> hi
? Your NLU model classified 'hi' with intent 'greet' and there are no entities, is this correct? Yes
------
Chat History

    #    Bot                        You        
────────────────────────────────────────────
    1    action_listen                         
────────────────────────────────────────────
    2                                       hi 
                            intent: greet 1.00 


------
Current slots: 
    session_started_metadata: None

? The bot wants to run 'action_default_fallback', correct? (Y/n)                                                                                                                                           

Cancelled by user

? Do you want to stop? (Use arrow keys)                                                                                                                                                                    

Cancelled by user

? Export stories to (if file exists, this will append the stories) data/stories.yml                                                                                                                        

Cancelled by user

2022-10-03 23:36:09 INFO     rasa.core.training.interactive  - Killing Sanic server now.


Erroneous Logs (Part 2)


? Your input -> hi
? Your NLU model classified 'hi' with intent 'greet' and there are no entities, is this correct? Yes
------
Chat History

    #    Bot                        You        
────────────────────────────────────────────
    1    action_listen                         
────────────────────────────────────────────
    2                                       hi 
                            intent: greet 1.00 


Current slots: 
    session_started_metadata: None

------
? The bot wants to run 'action_default_fallback', correct? Yes
/home/ashish/.local/lib/python3.8/site-packages/rasa/server.py:860: FutureWarning: The "POST /conversations/<conversation_id>/execute" endpoint is deprecated. Inserting actions to the tracker externally should be avoided. Actions should be predicted by the policies only.
    rasa.shared.utils.io.raise_warning(
------
Chat History



    #    Bot                You        
────────────────────────────────────
    1    action_listen                 
------
Current slots: 
    session_started_metadata: None

? The bot wants to run 'action_default_fallback', correct? Yes
------
Chat History



------
    #    Bot                You        
────────────────────────────────────
    1    action_listen                 
Current slots: 
    session_started_metadata: None  

? The bot wants to run 'action_default_fallback', correct? Yes
------    


What is this "action_default_fallback"?

The answer comes from Rasa Documentation:

Policy Priority

In the case that two policies predict with equal confidence (for example, the Memoization and Rule Policies might both predict with confidence 1), the priority of the policies is considered. Rasa Open Source policies have default priorities that are set to ensure the expected outcome in the case of a tie. They look like this, where higher numbers have higher priority:

    6 - RulePolicy

    3 - MemoizationPolicy or AugmentedMemoizationPolicy

    2 - UnexpecTEDIntentPolicy

    1 - TEDPolicy


Rule Policy

The RulePolicy is a policy that handles conversation parts that follow a fixed behavior (e.g. business logic). It makes predictions based on any rules you have in your training data. See the Rules documentation for further information on how to define rules.

The RulePolicy has the following configuration options:

File: config.yml


policies:
  - name: "RulePolicy"
    core_fallback_threshold: 0.3
    core_fallback_action_name: action_default_fallback
    enable_fallback_prediction: true
    restrict_rules: true
    check_for_contradictions: true

The values shown above are the default settings.

Ref: https://rasa.com/docs/rasa/policies

6 Labeled Datasets For Sentiment Analysis

Download Code and Data


import pandas as pd
import seaborn as sns


1. Amazon Reviews


amazon_reviews = pd.read_csv('input/amazonReviewSnippets_GroundTruth.txt', sep = '\t')

amazon_reviews['dataset'] = 'amazon'

def get_sentiment_label(sentiment_score):
    if (sentiment_score < 0):
        return 'Negative'
    else:
        return 'Positive'

amazon_reviews['sentiment_label'] = amazon_reviews['sentiment'].apply(get_sentiment_label)

amazon_reviews['length'] = amazon_reviews['text'].apply(len)

def get_word_count(text):
    text = text.split()
    return len(text)

amazon_reviews['word_count'] = amazon_reviews['text'].apply(get_word_count)

amazon_reviews.head()






sns.countplot(x ='sentiment_label', data = amazon_reviews)





amazon_reviews['word_count'].describe()


count    3546.000000
mean       17.300056
std        31.449383
min         1.000000
25%         9.000000
50%        15.000000
75%        21.000000
max      1220.000000
Name: word_count, dtype: float64


If number of max number of tokens in a text exceeds 512, plain BERT embedding cannot be used and we have to use SentenceBERT as the embedding technique.

2. Movie Reviews 


movie_reviews = pd.read_csv('input/movieReviewSnippets_GroundTruth.txt', sep = '\t')

movie_reviews['dataset'] = 'movie reviews'

movie_reviews['sentiment_label'] = movie_reviews['sentiment'].apply(get_sentiment_label)
movie_reviews['word_count'] = movie_reviews['text'].apply(get_word_count)

movie_reviews.head(5)




sns.countplot(x ='sentiment_label', data = movie_reviews)



movie_reviews['word_count'].describe()


count    10605.000000
mean        18.864875
std          8.702398
min          1.000000
25%         12.000000
50%         18.000000
75%         25.000000
max         51.000000
Name: word_count, dtype: float64


3. New York Editorial Snippets


nyt_editorial_snippets = pd.read_csv('input/nytEditorialSnippets_GroundTruth.txt', sep = '\t')
nyt_editorial_snippets['dataset'] = 'nyt_editorial_snippets'

nyt_editorial_snippets['sentiment_label'] = nyt_editorial_snippets['sentiment'].apply(get_sentiment_label)
nyt_editorial_snippets['word_count'] = nyt_editorial_snippets['text'].apply(get_word_count)

nyt_editorial_snippets.head()





sns.countplot(x ='sentiment_label', data = nyt_editorial_snippets)




nyt_editorial_snippets['word_count'].describe()


count    5183.000000
mean       17.482925
std         8.767046
min         1.000000
25%        11.000000
50%        17.000000
75%        23.000000
max        91.000000
Name: word_count, dtype: float64


4. General Twitter Data (Tweets)


tweets_groud_truth = pd.read_csv('input/tweets_GroundTruth.txt', sep = '\t')

tweets_groud_truth['dataset'] = 'tweets_groud_truth'

tweets_groud_truth['sentiment_label'] = tweets_groud_truth['sentiment'].apply(get_sentiment_label)
tweets_groud_truth['word_count'] = tweets_groud_truth['text'].apply(get_word_count)

tweets_groud_truth.head()





sns.countplot(x ='sentiment_label', data = tweets_groud_truth)




tweets_groud_truth['word_count'].describe()


count    4200.000000
mean       13.619286
std         6.720463
min         1.000000
25%         8.000000
50%        13.000000
75%        19.000000
max        32.000000
Name: word_count, dtype: float64


5. US Presidential Election of 2016


us_presidential_election_2016 = pd.read_csv('input/us_politics_presidential_election_2016.csv', sep = ',')

us_presidential_election_2016 = us_presidential_election_2016[['id', 'sentiment', 'text']]

us_presidential_election_2016['dataset'] = 'us_presidential_election_2016'
us_presidential_election_2016.head()






sns.countplot(x ='sentiment', data = us_presidential_election_2016)






us_presidential_election_2016['word_count'] = us_presidential_election_2016['text'].apply(get_word_count)
us_presidential_election_2016['word_count'].describe()



count    13871.000000
mean        16.943912
std          5.224908
min          2.000000
25%         13.000000
50%         18.000000
75%         21.000000
max         29.000000
Name: word_count, dtype: float64


6. Stock Market Related Tweets


stock_market_tweets = pd.read_csv('input/stock_market_twitter_data.csv')
stock_market_tweets['sentiment_label'] = stock_market_tweets['Sentiment'].apply(get_sentiment_label)
stock_market_tweets['word_count'] = stock_market_tweets['Text'].apply(get_word_count)

stock_market_tweets.head()





sns.countplot(x ='sentiment_label', data = stock_market_tweets)




stock_market_tweets['word_count'].describe()


count    5791.000000
mean       14.006562
std         6.595463
min         2.000000
25%         9.000000
50%        14.000000
75%        19.000000
max        32.000000
Name: word_count, dtype: float64

Pages

Friday, October 7, 2022

Test

In Jupyter Lab

In Python CLI

Re-installing via pip3

Test Through Top Import And Printing Version

Further Testing Through Code

In Jupyter Lab

In Python CLI

FRESH INSTALLATION USING ENV.YML FILE

Same Error

Wednesday, October 5, 2022

Run Logs

Setup the Jupyter Lab Kernel

1. nltk

2. scikit-learn

3. spaCy: Industrial-strength NLP

4. gensim

5. word2vec

6. GloVe

7. fastText

8. TextWiser: Text Featurization Library

9. BERT-As-a-Service

10. transformers

11. torch

12. sentence-transformers

13. Scrapy

14. Rasa

15. Sentiment Analysis using BERT, DistilBERT and ALBERT

16. pyLDAvis

17. scipy

18. twitter

19. spark-nlp

20. keras-transformer

21. pronouncing

22. random-word

23. langdetect

24. PyPDF2

25. python-docx

26. emoji

27. pattern

28. wordcloud

Installation with pip3

29. Social Analysis (SOAN)

Monday, October 3, 2022

Usage

Next we are going to show chat logs 'with the RulePolicy configuration' and 'without the RulePolicy configuration'

Without the RulePolicy configuration (this is the default setting you get from '$ rasa init')

Logs with the RulePolicy configuration

Erroneous Logs (Part 1)

Erroneous Logs (Part 2)

What is this "action_default_fallback"?

Policy Priority

Rule Policy

1. Amazon Reviews

If number of max number of tokens in a text exceeds 512, plain BERT embedding cannot be used and we have to use SentenceBERT as the embedding technique.

2. Movie Reviews

3. New York Editorial Snippets

4. General Twitter Data (Tweets)

5. US Presidential Election of 2016

6. Stock Market Related Tweets