nodejs Python PHP ruby go perl 处理单个 4 百兆 csv 文件比较

219 天前
 zhouyin

###耗时

perl 最慢 等不及处理完 就停止了 perl

nodejs 1 分钟多

php 30 多秒

ruby 30 多秒

python 11 秒左右

go 4 秒左右

###时间上 go 和 python 胜出

###功能上面 这个 csv 文件不标准 有个字段有个单个双引号

go 和 nodejs 和 ruby 都报错 无法处理完 上面它们两个的时间是把那个单引号移除后的 csv 文件

php 没报错 但因为单个双引号忽略了很多行 它把那些双引号当分界符了

功能上 python 胜出 python 完全能处理不标准的 csv 最后能生成正确 csv 就几行代码

###代码写起来 nodejs 最恶心

nodejs 屌什么屌 非常像 ghostscirpt 作者评价 perl 的话:perl 像从狗的肛门里吐出来的东西

写这么个小项目 感觉 nodejs 才像从狗的肛门里吐出来的东西

1767 次点击
所在节点    分享发现
23 条回复
ysc3839
219 天前
所以代码呢?
zhouyin
219 天前
代码传不上来

看这里

https://cowtransfer.com/s/f0a48d2009fd4f
zhouyin
219 天前
hefish
219 天前
哈哈,说的非常高级。
gainsurier
219 天前
估计 C 写需要一秒吗
zhouyin
219 天前
@gainsurier
python 和 php ruby 不就是 c 实现的么 只是 python 实现得好
chenqh
219 天前
python 为什么会那么快?难道是 C 库?
chenqh
219 天前
等等 nodejs 怎么这么快?JIT 呢?比 php 和 ruby 这种没 JIT 都慢?
zhouyin
219 天前
@gainsurier
还有 nodejs c++实现 没 python 做得好
henbf
219 天前
喷 Node.js 之前反思一下自己是不是应该先搞清楚 I/O 和流的基本概念
zhouyin
219 天前
@henbf
我不是 nodejs 高手 我把 a.js 更新了 使用了输出流 但现在报堆溢出错误了 :

```bash
-bash-4.2# node a.js
(node:17974) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 drain listeners added to [WriteStream]. Use emitter.setMaxListeners() to increase limit
(Use `node --trace-warnings ...` to show where the warning was created)
(node:17974) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 drain listeners added to [WriteStream]. Use emitter.setMaxListeners() to increase limit
(node:17974) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 drain listeners added to [WriteStream]. Use emitter.setMaxListeners() to increase limit

<--- Last few GCs --->

[17974:0x1c3dbf0] 40306 ms: Scavenge (reduce) 2046.8 (2082.1) -> 2046.5 (2082.6) MB, 44.4 / 0.0 ms (average mu = 0.342, current mu = 0.316) allocation failure
[17974:0x1c3dbf0] 40396 ms: Scavenge (reduce) 2047.2 (2082.6) -> 2046.8 (2082.8) MB, 31.1 / 0.0 ms (average mu = 0.342, current mu = 0.316) allocation failure


<--- JS stacktrace --->

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
1: 0x7fcfb6136908 node::Abort() [/lib64/libnode.so.93]
2: 0x7fcfb6024451 [/lib64/libnode.so.93]
3: 0x7fcfb732a552 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/lib64/libnode.so.93]
4: 0x7fcfb732a8e7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/lib64/libnode.so.93]
5: 0x7fcfb74ea305 [/lib64/libnode.so.93]
6: 0x7fcfb74ea3e5 [/lib64/libnode.so.93]
7: 0x7fcfb74fe77c v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) [/lib64/libnode.so.93]
8: 0x7fcfb74ff0a1 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/lib64/libnode.so.93]
9: 0x7fcfb7502269 v8::internal::Heap::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/lib64/libnode.so.93]
10: 0x7fcfb75022f7 v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/lib64/libnode.so.93]
11: 0x7fcfb74c27d0 v8::internal::Factory::AllocateRaw(int, v8::internal::AllocationType, v8::internal::AllocationAlignment) [/lib64/libnode.so.93]
12: 0x7fcfb74badb4 v8::internal::FactoryBase<v8::internal::Factory>::AllocateRawWithImmortalMap(int, v8::internal::AllocationType, v8::internal::Map, v8::internal::AllocationAlignment) [/lib64/libnode.so.93]
13: 0x7fcfb74bcbdf v8::internal::FactoryBase<v8::internal::Factory>::NewRawOneByteString(int, v8::internal::AllocationType) [/lib64/libnode.so.93]
14: 0x7fcfb74c4d5d v8::internal::Factory::NewStringFromUtf8(v8::base::Vector<char const> const&, v8::internal::AllocationType) [/lib64/libnode.so.93]
15: 0x7fcfb733d59d v8::String::NewFromUtf8(v8::Isolate*, char const*, v8::NewStringType, int) [/lib64/libnode.so.93]
16: 0x7fcfb6215390 node::StringBytes::Encode(v8::Isolate*, char const*, unsigned long, node::encoding, v8::Local<v8::Value>*) [/lib64/libnode.so.93]
17: 0x7fcfb6123ef3 [/lib64/libnode.so.93]
18: 0x7fcfb71ba3cc [/lib64/libnode.so.93]
Aborted
```
henbf
219 天前
@zhouyin 你的写的不对

const { createReadStream, createWriteStream } = require("fs");
const { parse } = require("csv-parse");

const inputPath = "../outpy.csv";
const outputPath = "./test.txt";

const readStream = createReadStream(inputPath);
const writeStream = createWriteStream(outputPath, { flags: "a" });

const parser = parse({ delimiter: ",", from_line: 2 });

readStream.pipe(parser);

parser.on("data", (row) => {
writeStream.write(row.join(",") + "\n");
});

parser.on("end", () => {
console.log("finished");
writeStream.end();
});

parser.on("error", (error) => {
console.error("CSV Parsing Error:", error);
});
zhouyin
219 天前
一开始我就是差不多你这样写的 没想到速度没提升 所以改成那样 以为 write 那里有缓冲区

一字不换把你的代码 运行 结果 耗时 一分钟多 望 python 莫及

-bash-4.2# time node a.js
finished

real 1m3.579s
user 1m4.103s
sys 0m2.478s
henbf
219 天前
@zhouyin 这中间还要看你对 csv 的每一行进行了怎么样的处理,你用 python 只是一读一写没有任何额外的处理,相当于复制。用 Node.js ,你却把每一行转换成数组,写的时候又把数组转换成字符串,当然慢了。

const { createReadStream, createWriteStream } = require("fs");

const inputPath = "../outpy.csv";
const outputPath = "./test.txt";


const readStream = createReadStream(inputPath, { highWaterMark: 256 * 1024 });
const writeStream = createWriteStream(outputPath, { flags: "a" });

readStream.pipe(writeStream);

readStream.on("end", () => {
console.log("finished");
writeStream.end();
});

readStream.on("error", (err) => {
console.error("Error reading file:", err);
});

writeStream.on("error", (err) => {
console.error("Error writing file:", err);
});
zhouyin
219 天前
@henbf
python 返回的是数组 只是写入的也是数组
zhouyin
219 天前
@henbf

我又用了一个库 csvwriter 慢得不得了

python 库就是设计得好 不服不行
zhouyin
219 天前
@zhouyin
用了 csvwriter 时间 3 分多

-bash-4.2# time node a.js
finished

real 3m45.028s
user 4m12.751s
sys 2m59.847s
henbf
219 天前
@zhouyin ✅✅✅,Node.js 不适合解析 csv ,Python 牛逼
stabc
219 天前
1. 解析 csv ,要一个字符一个字符拆分和拼接,底层语言绝对优势,因为可以根据位置拿来直接用,而 node 每次都创建新 string 对象。

2. python 标准库就有 csv 模块,所以也是底层在执行,那么他比 go 语言慢那么多,说明写的比较差。

3. 我刚才简单测试了一下,node 如果优化一下解析过程,减少字符串拼接,解析 400M 的 csv 文件,总用时可以压缩到 5 秒以内。
gesse
219 天前
@henbf 哈哈哈

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/1109735

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX