菜鸟又来求助 pandas 了

大佬们，我现在有两个 DataFrame 需要横向合并，问了套壳免费的 chatgpt 和文心一言都没有搞定，跪求大佬们帮忙看看。

其中 df1 长这样：

	id	regions	isp	answers
1	1	广东	电信	xxx.xxx.com. xxx.xxx.xxx.com. 1.1.1.1 中国深圳电信 2.2.2.2 中国深圳电信
2	2	上海	电信	xxx.xxx.com. xxx.xxx.xxx.com. 3.3.3.3 中国上海电信 4.4.4.4 中国上海电信

df2 长这样

Content-Type	Content-Length	Connection	Accept-Ranges	Age	ip	status_code
text/plain	15310871	keep-alive	bytes	13	1.1.1.1	200
text/plain	15310871	keep-alive	bytes	0	2.2.2.2	403
text/plain	4668490	keep-alive	bytes	20	3.3.3.3	200
text/plain	15310871	keep-alive	bytes	25	4.4.4.4	200

想要合并成这样（由于太长了影响观看，中间有些列我编辑 v2 的时候就删掉了）：

answers	ip	Content-Length	Age	status_code
xxx.xxx.com. xxx.xxx.xxx.com. 1.1.1.1 中国深圳电信 2.2.2.2 中国深圳电信	1.1.1.1 2.2.2.2	15310871 15310871	13 0	200 403
xxx.xxx.com. xxx.xxx.xxx.com. 3.3.3.3 中国上海电信 4.4.4.4 中国上海电信	3.3.3.3 4.4.4.4	4668490 15310871	20 25	200 200

合并的要求是 df1 里面的 answers 列里面的值如果包含了 df2 里面 ip 列的值，就合并到一行里面来

我现在 df1 里面 answers 列的每个值，是用的\n 换行符连接的字符串，然后合并之后列，也希望是\n 连接，比如 1.1.1.1\n2.2.2.2 ，这样到时候输出到表格就和 v2 这里展示的一样了

上面的描述不知道把需求表达清楚了没，感觉这个需求有点变态，我用 merge 尝试了好久没搞定，跪求大佬帮忙看看

512357301

2023-06-28 22:30:42 +08:00

没用过 pandas ，不过横向合并跟 SQL 的 join 差不多，也是用一个或多个关联列进行关联的。
你这个感觉得搞虚拟列，把 answers 列的内容替换成 IP 列那样的格式，这步用 gpt 应该是可以找到答案的。然后就是 pandas 的更想合并了，这个 gpt 也可以搞定。
不要指望着一次就搞定这两步的需求。而且这些东西通过百度谷歌也可以搜到答案的，不用依赖 gpt 。
第一步的思路应该是得用 Python 里的正则库，第二步就是用 pandas 的 merge 了

Rommy

2023-06-29 00:25:02 +08:00

import pandas as pd
import re
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('max_colwidth',100)

df1=pd.DataFrame({'id':[1,2],
'regions':['广东','上海'],
'isp':['电信','电信'],
'answers':['xxx.xxx.com.\nxxx.xxx.xxx.com.\n1.1.1.1 中国深圳电信\n2.2.2.2 中国深圳电信\n',
'xxx.xxx.com.\nxxx.xxx.xxx.com.\n3.3.3.3 中国上海电信\n4.4.4.4 中国上海电信\n']})

df2=pd.DataFrame({'Age':[13,0,20,25],
'ip':['1.1.1.1',
'2.2.2.2',
'3.3.3.3',
'4.4.4.4'],
'status_code':[200,403,200,200]})
for column in df2.columns:
df2[column]=df2[column].apply(str)

def ip_extract(input_string):
ip_pattern = r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
ip_addresses = re.findall(ip_pattern, input_string)
return '\n'.join(ip_addresses)
df1_new = df1.copy()
df1_new['ip'] = df1_new['answers'].apply(ip_extract)

df_ip = df1_new['ip'].str.split('\n',expand=True).stack().reset_index(level=1,drop=True).to_frame(name='ip')

df_merge = df1_new.drop(['ip'],axis=1).join(df_ip).merge(df2,on=['ip'])

def concat_func(x):
return pd.Series({column:'\n'.join(x[column]) for column in df2.columns})
df_group = df_merge.groupby(['answers']).apply(concat_func).reset_index()

df = df1.merge(df_group,on=['answers'])
print(df)

Rommy

2023-06-29 20:07:34 +08:00

@cy1027 问个问题还要瞻前顾后，纠结是否优雅，这是学习该有的态度吗？另外，我就是太久没用 pandas ，看着有点兴致，试着写了下，写完分享一下。你如果是说我哪里写的不好，太粗糙，那我接受，但我觉得我的行为没啥可指摘的。最后，gpt 的实现依赖的是人与人之间的交流对话文本，这背后都是一个个特别基础的问题与回答，有啥可傲慢的？

wxf666

2023-06-30 00:00:15 +08:00

试着写了一个好懂一些的：

*（ V 站排版会吃掉行首空格，所以替换成了全角空格。若要使用，注意替换）*

```python
import re
import pandas as pd

df1 = pd.DataFrame({
　　'id': [1, 2],
　　'isp': ['电信', '电信'],
　　'regions': ['广东', '上海'],
　　'answers': [
　　　　'xxx.xxx.com.\nxxx.xxx.xxx.com.\n1.1.1.1 中国深圳电信\n2.2.2.2 中国深圳电信\n',
　　　　'xxx.xxx.com.\nxxx.xxx.xxx.com.\n3.3.3.3 中国上海电信\n4.4.4.4 中国上海电信\n',
　　],
})

df2 = pd.DataFrame({
　　'Age': [13, 0, 20, 25],
　　'ip': [
　　　　'1.1.1.1',
　　　　'2.2.2.2',
　　　　'3.3.3.3',
　　　　'4.4.4.4',
　　],
　　'status_code': [200, 403, 200, 200],
})

df_ip = (
　　 df1
　　.set_index('id')['answers'].str
　　.extractall(r'^(?P<ip>[^\s]+)', flags=re.M)
　　.reset_index(level='id')
　　.set_index('ip')
)

df_result = (
　　 df2
　　.merge(df_ip, how='left', on='ip')
　　.groupby('id')
　　.agg({
　　　　'ip': '\n'.join,
　　　　'Age': lambda s: '\n'.join(s.astype('string')),
　　　　'status_code': lambda s: '\n'.join(s.astype('string')),
　　})
　　.merge(df1, how='left', on='id')[[
　　　　'answers',
　　　　'ip',
　　　　'Age',
　　　　'status_code',
　　]]
)
```