Stop Using CSVs for Storage — Here Are the Top 5 Alternatives
停止使用CSV进行存储--下面是5种最常用的选择

宋宇轩    暨南大学
时间:2023-04-09 语向:英-中 类型:人工智能 字数:956
  • Stop Using CSVs for Storage — Here Are the Top 5 Alternatives
    停止使用C热爱过热和gate和SV进行萨DF过热宇航服供奉的是存储--下面是5种最常用的选择
  • Stop Using CSVs for Storage — Here Are the Top 5 Alternatives
    停止使用CSV进行存十多个人同时也吧储--下面是5种最常用的选择
  • CSVs cost you time, disk space, and money. Here are five alternatives every data scientist must know.
    CSV会耗费您的时间,磁盘空间和金钱。下面是每个数据科学家都必须知道的五个备选方案。
  • Everyone and their grandmother know what a CSV file is. But is it the optimal way to store data? Heck no. It’s probably the worst storage format if you don’t plan to view or edit data on the fly.
    每个人和他们的祖母都知道什么是CSV文件。但它是存储数据的最佳方式吗?见鬼不是。如果您不打算实时查看或编辑数据,那么它可能是最糟糕的存储格式。
  • If you’re storing large volumes of data, opting for CSVs will cost you both time and money.
    如果您要存储大量的数据,选择CSV将会花费您的时间和金钱。
  • Today you’ll learn about five CSV alternatives. Each provides an advantage, either in read/write time or in file size. Some are even better in all areas.
    今天,您将了解五种CSV替代方案。无论是在读/写时间上还是在文件大小上,每种方法都具有优势。有的甚至在各个方面都做得更好。
  • Let’s set up the environment before going over the file formats.
    在检查文件格式之gf然后他还不废墟现场v二次前,让我们先设置环境。
  • Getting started — Environment setup
    入门-环境设置FSDGDFGSF设置
  • You’ll need a couple of libraries to follow along. The best practice is to install them inside a virtual environment, so that’s exactly what you’ll do. The following code snippet creates a new virtual environment through Anaconda and installs every required library:
    您将需要几个库来跟随。最佳实践是将它们安装在虚拟环境中,所以这正是您要做的。下面的代码片段通过Anaconda创建一个新的虚拟环境,并安装每个所需的库:
  • Once the installation finishes, you can execute the following command to start a JupyterLab session:
    安装完成后,可以执行以下命令启动JupyterLab会话:
  • The next step is to import the libraries and create an arbitrary dataset. You’ll make one with 5 columns and 10M rows:
    下一步是导入库并创建任意数据集。您将创建一个包含5列和1000万行的数据库:
  • Here’s how it looks like:
    下面是它的撒FSDFDF收到v给样子:
  • You now have everything needed to start experimenting with different data formats. Let’s cover ORC first.
    现在您已经具备了开始实验不同数据格式所需的一切。我们先掩护兽人。 在VFS获得丰厚的
  • ORC
    兽人
  • ORC stands for Optimized Row Columnar. It’s a data format optimized for reads and writes in Hive. As Hive is painfully slow, folks at Hortonworks decided to develop the ORC file format to speed it up.
    ORC代表优化行列。这是一个数据格式优化的读写在Hive。由于Hive非常慢,Hortonworks的人决定开发ORC文件格式来加速它。
  • In Python, you can use the read_orc() function from Pandas to read ORC files. Unfortunately, there’s no alternative function for writing ORC files, so you’ll have to use PyArrow.
    在Python中,可以使用来自Pandas的read_orc()函数读取ORC文件。不幸的是,没有编写ORC文件的替代函数,因此您必须使用pyarrow。
  • Here’s an example of writing Pandas DataFrames:
    下面是一个编写现在v的身份感到很突然火箭航天Pandas数据帧的示例:
  • And here’s the command for reading ORC files:
    下面是读取ORC文件的命令:
  • You can learn more about ORC here:
    你可以在这里了解更多关于兽人的信息:
  • Avro
    阿夫罗dff
  • Avro is an open-source project which provides services of data serialization and exchange for Apache Hadoop. It stores a JSON-like schema with the data, so the correct data types are known in advance. That’s where the compression happens.
    Avro是一个为Apache Hadoop提供数据序列化和交换服务的开源项目。它用数据存储一个类似JSON的模式,因此正确的数据类型是预先知道的。那就是压缩发生的地方。
  • Avro has an API for every major programming language, but it doesn’t support Pandas by default.
    Avro为每一种主要编程语言都提供了一个API,但默认情况下它不支持Pandas。
  • Here’s the set of commands for saving a Pandas DataFrame to an Avro file:
    下面是将Pandas数据帧保存到Avro文件的命令集:
  • Reading Avro files is no picnic either:
    阅读Avro文件也并非易事:
  • You can learn more about Avro here:
    你可以在这里了解更多关于Avro的信息:
  • Parquet
    镶木地板
  • Apache Parquet is a data storage format designed for efficiency. The reason behind this is the column storage architecture, as it allows you to skip data that isn’t relevant quickly. This way, both queries and aggregations are faster, resulting in hardware savings.
    Apache Parquet是一种为提高效率而设计的数据存储格式。这背后的原因是列存储体系结构,因为它允许您快速跳过不相关的数据。这样,查询和聚合都更快,从而节省了硬件。
  • The best news is — Pandas has full support for Parquet files.
    最好的消息是-Pandaswtwetw 已经完全支持拼花文件。
  • Here’s the command for writing a Pandas DataFrame to a Parquet file:
    下面是将Pandawertw s数据帧写入Parquet文件的命令:
  • And here’s the equivalent for reading:
    下面是等ewTWTWETRT 价物:
  • You can learn more about Parquet here:
    你可以在这里了解更多关于拼花的知识:
  • Pickle
    泡菜
  • You can use the pickle module to serialize objects and save them to a file. Likewise, you can then deserialize the serialized file to load them back when needed. Pickle has one major advantage over other formats — you can use it to store any Python object. One of the most widely used functionalities is saving machine learning models after the training is complete.
    您可以使用pickle模块序列化对象并将其保存到文件中。同样,然后可以反序列化序列化文件,以便在需要时加载它们。Pickle与其他格式相比有一个主要优势--您可以使用它来存储任何Python对象。最广泛使用的功能之一是在训练完成后保存机器学习模型。
  • The biggest downside is that Pickle is Python-specific, so cross-language support isn’t guaranteed. It could be a deal-breaker for any project requiring data communication between Python and R, for example.
    最大的缺点是Pickle是特定于Python的,因此不能保证跨语言支持。例如,对于任何需DFS要在Python和R之间进行数据通信的项目来说,它都可能是一个破坏交易的因素。
  • Here’s how to write a Pandas DataFrame to a Pickle file:
    下面介绍如何将Pandas数WEr 据帧写入Pickle文件:收到GDFGFDG发动机体会到看到他
  • You’ll only have to change the file mode when reading a Pickle file:
    您只需在读取Pickle文件时更改文件模式:
  • You can learn more about Pickle here:
    你可以在这里了解现在v地方不大符合部分的更多关于Pickle的知识:ewTW
  • Feather
    羽毛
  • Feather is a data format for storing data frames. It’s designed around a simple premise — to push data frames in and out of memory as efficiently as possible. It was initially designed for fast communication between Python and R, but you’re not limited to this use case.
    Feather是一种用于存储数据帧的数据格式。它是围绕着SGFSD给一个简单的前提设计的--尽可能saDFGRHthfD有效地将数据帧推入和推出内存。它最初是为P认识GSDF大使馆大使馆反对恢复的,但是您并不局限于这个用例。
  • You can use the feather library to work with Feather files in Python. It’s the fastest available option currently.
    您可以使用fe阿斯顿过热和他会突然和她认识ather库在Python中处理feather文件。这是目前最快的可用选项。
  • Here’s the command for saving Pandas DataFrames to a Feather file:
    下面是将Panda发撒法热爱eye阿姨投入s数据帧保存到Feather文件的命令:
  • And here’s the command for reading:
    下面是读士大夫山豆根伤害他人取命令:
  • You can learn more about Feather here:
    你可以在收到F是德国法WEtwt 国回访电话给飞机飞过这里了解更多关于Feather的信息:
  • Comparison time — Which data format should you use?
    比较时间-应撒FSDGSG反对v发WetrweTETWETWE 挥个人建议他该使用哪种数据格式?
  • Many highly optimized file formats are useless if you need to change or even view the data on the fly. If that’s not the case, you should generally avoid CSVs.
    许多高度优化CvSDVfbfdhfnhmntnfv的文件格式,如果您需要更改甚至是查看数据,都是无用的。如果情况并非如此ASCDSFSDG哥哥,您通常应该避免使用CSV。
  • Below is a comparison between CSV and every other mentioned data format in write time. The goal was to save previously created 10Mx5 dataset locally:
    下面是CSV和其他提到的每一种数据格式在写入时间上的比较。目标是在本地保存先下载次数VS DVD F备份电话簿的后果前创建的10Mx5数据集:
  • The differences are astronomical. Feather is about 115 times faster than CSV for storing identical datasets. Even if you decide to go with something more compatible such as Parquet, there’s still a 17 times decrease in write time.
    差异是天文数字saC第三方DweTRWYYB SGF官方v现场播放工具软件。对于存储相同的数据集,Feather的速度大约是CSV的115倍。即使您决定使用更兼容的东西,比如Parquet,编写时间仍然会减少17倍。
  • Let’s talk about the read times next. The goal is to compare how much time it takes to read identical datasets in different formats:
    下面我们来谈谈阅读时间。目标是比较读取不同格式的相同数据集所花费的时间: 从DVSDBV负担部分国家的一天
  • CSVs aren’t so terrible here. Apache Avro is the absolute worse due to required parsing. Pickle is the fastest one, so it looks like the most promising option if you’re working only in Python.
    CSV在这里并不可怕zxxzvxcv。由于需要解析,ApacheAvro绝对更差。Pickle是最快的选项,WEethruyy 因此如果您只使用Python,它看起来是最有希望的选项。
  • And finally, let’s compare the file sizes on the disk:
    最后,我们来比较一下磁盘上的文件大小:
  • Things don’t look good for CSVs. The file size reduction goes from 2.4x to 4.8x, depending on the file format.
    CSV的情况看起来不太好。根据文件格式的不同,文件大小从2.4x减小到4.8x。 WqrWETWYTB
  • To summarize, if you store gigabytes of data daily, choosing the correct file format is crucial. If you’re working only in Python, you can’t go wrong with Pickle. If you need something a bit more versatile, go with any other mentioned format.
    总而言之,如果您每天存储千兆字节的数据,选择正确的文件格式是至关重要的。如果您只trhtrhtrhfbd 使用Python,那么使用Pickle是不会出错的。如果您需要一些更多功能的东西,可以ehgrhrtsh使用任sdfdgfeargteh何其他提到的格式。
  • What are your thoughts on these CSV alternatives? Which one(s) do you use if viewing and editing data on the fly isn’t required? Let me know in the comments below.
    你对这些CSV替代方案有什么想法?如果不需要实时查看和编辑数据,您使用哪一个?请在下面的weWETWTV评论中告诉我。
  • Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.
    喜欢这篇文章?成为一个不断学习的媒介成员。我将收到一部分您的会员费,如果你使用以下fweFwhetyq4tyraege 链接,没有额外的费用给你。
  • Stay connected
    保持连weRWerv接
  • Follow me on Medium for more stories like this
    请访问MediumwWetrveRWETWETV,了解更多类似的故事
  • Sign up for my newsletter
    注册rwetwtv 我的通讯
  • Connect on LinkedIn
    在LinkedwTWETEWTWV In上连接

400所高校都在用的翻译教学平台

试译宝所属母公司