数据科学 (Data science)

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you may not be able to execute some actions.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

子版块

Excel & VBA

39
主题

108
帖子

J

在上一篇帖子当中我们介绍了XMLHTTP对象及方法获取网页中Table对应的网页字符串. 那么如何从这些字符串中获取我们想要的数据呢?在python或者java中这些会有对应的包和函数可以实现. 而本篇将会介绍一种简单而基础的方法: 正则表达式 (Regex). 3.正则表达式 (Regex): 正则表达式是一种用于匹配字符串中字符模式的工具, 它可以用来搜索、编辑和处理文本. 正则表达式的作用非常强大，详细的总结可以参考CSDN大佬的帖子：VBA学习笔记六：正则表达式. 下面我会总结用到的一些对象和方法: 3.1 创建正则表达式对象: 3.1.1 前期绑定: 添加引用库：工具 > 引用 >勾选 ’Microsoft VBScript Regular Expressions 5.5‘ '直接声明reg对象就可以使用' Dim reg as New RegExp 3.1.2 后期绑定 (推荐使用): 使用 CreateObject 方法来创建 VBScript.RegExp 对象，便可使用代码 '先声明再创建reg对象' Dim reg As Object Set reg = CreateObject("VBScript.RegExp") 3.2 用到的属性及方法: 3.2.1 Pattern 属性 • 描述：定义要匹配的正则表达式模式 • 示例：匹配任意四个字母的单词 reg.Pattern = "\b\w{4}\b" 3.2.2 Global 属性 • 描述：设置是否全局搜索匹配。默认为 False，只匹配第一个 • 示例：全局匹配所有符合模式的字符串 reg.Global = True 3.2.3 Execute 方法 • 描述：执行正则表达式匹配，查找所有符合匹配模式的子字符串，并返回一个匹配集合对象 (MatchCollection对象) • 示例：查找匹配项，其中 string 是要搜索的字符串 Set matches = reg.Execute(string) 3.3 匹配规则这里我照抄CSDN文章: 3.4 匹配结果的对象: 3.4.1 MatchCollection对象: 含义：是一个集合对象，包含了正则表达式在输入文本中找到的所有匹配项，每个匹配项都表示为一个 Match 对象. 用到的属性: Item 属性 ○ 功能：返回集合中第几个 Match 对象，从0开始计数 ○ 用法：matches.Item(index) 或 matches(index) ○ 示例：输出第一个匹配到的对象 Set match = matches(0) MsgBox "Found match: " & match.Value 3.4.2 Match对象: 含义：表示正则表达式匹配中的一个匹配项，包含匹配的文本和其他相关信息用到的属性: SubMatches 属性 ○ 功能：返回一个 SubMatches 集合（注意这里返回了集合，因而适用MatchCollection对象的所有方法），包含所有捕获的子匹配项，通过索引访问捕获组的内容（索引从0开始）可以获取匹配到的分组内容 ○ 前提：在正则表达式模式中使用了括号来创建捕获组 (参考3.3.>>>) ○ 用法：match.SubMatches(n)*表示访问子匹配项第n+1个捕获组的内容，也就是表达式里面的第n+1个括号 *这里的语法也可以写成match.SubMatches.Item(n)来表达同样的意思。 4.正则表达式应用实例: 4.1 代码及注释 Sub getrates(s As String) Dim reg As Object, m As Object, mchs As Object Dim i As Long, j As Long, p As String Set reg = CreateObject("vbscript.regexp")'后期绑定创建regex' '----regex' p = "<tr>\s*<td.*><time.*>([^<]*)</time>\s*</td>\s*<td.*>.*</td>\s*<td.*>.*</td>\s*<td.*>.*</td>\s*<td.*>.*</td>\s*<td.*>.*</td>\s*<td.*>.*</td>\s*<td.*>.*</td>\s*<td.*>.*</td>\s*<td.*>.*</td>\s*<td.*>.*</td>\s*<td.*>.*</td>\s*<td.*>([^<]*)</td>\s*<td.*>([^<]*)</td>\s*<td.*>([^<]*)</td>\s*<td.*>([^<]*)</td>\s*<td.*>([^<]*)</td>\s*<td.*>([^<]*)</td>\s*<td.*>([^<]*)</td>\s*<td.*>([^<]*)</td>\s*<td.*>([^<]*)</td>\s*<td.*>([^<]*)</td>\s*<td.*>([^<]*)</td>\s*<td.*>([^<]*)</td>\s*<td.*>([^<]*)</td>\s*<td.*>([^<]*)</td>\s*</tr>" reg.pattern = p'3.2.1 pattern属性' reg.Global = True'3.2.2 global属性' Set mchs = reg.Execute(s)'3.2.3 execute方法' i = 6 For Each m In mchs'3.4.1 MatchCollection对象' For j = 0 To m.submatches.Count - 1 Cells(i, j + 1) = m.submatches.Item(j) '利用3.4.2match对象的submatches属性去将捕获字符输出到单元格中' Next j i = i + 1 Next m End Sub 4.2 正则表达式p的写法: 这里我复制出来整个Table的字符串, 以第一行（<tr>...</tr>为一行的字符串）为例, 用黄色标出来我想要的字符. 参照3.3匹配规则中的捕获组方法, 将需要输出的字符以()包含进去,根据网页字符串的语法<tr>.*</tr>和<td>.*</td>做一些识别与中间内容的捕获, 作者就得到了如下的正则表达式. 5.总结: 作者所使用的方法基本可以分为两步: 第一步, 利用XMLHTTP对象先从网页中返回字符串; 第二步, 利用正则表达式的匹配及捕获funtion从字符串中返回需要的字符 6.杂谈: 本帖是作者应管理员珂珂之邀写的第一篇技术分享帖子, 谢邀~ 但作者对本论坛的流量表示怀疑, 因为截止到作者写part 2时, part 1帖仅有两个浏览量, 其中一个还是作者本人贡献QAQ. 在此呼吁珂珂阿姨将论坛内容在公众号中做实时更新, 并祝愿珂珂阿姨做大做强, 再创辉煌~ 此致, 敬礼^.^-!
R

31
主题

71
帖子

M

参考： https://socialsciences.mcmaster.ca/jfox/Books/Companion-2E/appendix/Appendix-Cox-Regression.pdf https://cran.r-project.org/web/packages/survival/survival.pdf https://stackoverflow.com/questions/43173044/how-to-compute-the-mean-survival-time
Python

16
主题

34
帖子

B

先前自己做一些内部开发程序的封装时，一直采用input的语句让用户输入文件路径： file_path = str(input('请输入文件路径：')) 最近研究GUI时发现了一个对用户更友好的操作，可以利用tkinter库来打开一个窗口，直接选择文件，具体代码及实现效果如下（其中，select_file()是选择文件，select_dierectory()是选择文件夹） import tkinter as tk from tkinter import filedialog def select_file(): root =tk.Tk() root.withdraw() #隐藏主窗口 file_path = filedialog.askopenfilename() return file_path def select_directory(): root =tk.Tk() root.withdraw() #隐藏主窗口 dir_path = filedialog.askdirectory() return dir_path if __name__=='__main__': select_file()
SAS

9
主题

9
帖子

G

附件是我觉得比较好用的一些SAS相关书籍，在此推荐给大家。注：仅供学习使用书目下载链接如下： The Little SAS Book 中文版.pdf The Little SAS Book(4th高清).PDF 深入解析SAS 数据处理、分析优化与商业应用1.pdf 深入解析SAS 数据处理、分析优化与商业应用2.pdf SAS统计分析与数据挖掘.pdf SAS编程技术与金融数据处理.pdf SAS SQL Procedure：User's guide.pdf SAS SBA JJ all cases combined since 15.12.2015.pdf SAS Macro Programming Made Easy.pdf 其中，《The little SAS Book》是目前而言，新手入门推荐最多的，中英文版均可下载；《深入解析SAS》是我一直作为工具书备查的，SAS自带的SAS Help也很全面和好用，但是因为是全英文，有些不喜欢看英文的朋友或新手可能不太习惯。上述的书属于比较全面的教程，而其它几本书都是有侧重的，可以选看。
SQL

7
主题

11
帖子

M

另外，可以参考这个StackOverflow的问题把这个sqlcmd封装到sql Script里面
Stata

0
主题

0
帖子

没有新主题
Power BI

1
主题

1
帖子

M

推荐这些用Power BI做的优秀的可视化作品给大家 https://training.bielite.com/contest-april-2021-youtube-analysis/
JavaScript

1
主题

4
帖子

M

同时，我们可以用TypeScript网站上的一个例子在Excel里面运行。下面是一个简单的判断语句，如果Counter比100小，那么就给它加一输出 const max = 100; let counter = 0; if (counter < max) { counter++; } console.log(counter); // 1 运行！
Julia

1
主题

4
帖子

M

在之前的例子里，我们用的都是固定利率，如果是Yield Curve的情况该怎么办呢？我们可以通过不同Materity的Zero-Coupon Bond的Spot Rate构建一个收益率曲线，命名为curve 然后用这个Curve来算现值和久期

登录以发表

数据科学 (Data science)

M

接VBA, Python, R咨询
• Mengkelyu

2

0
赞同

2
帖子

29
浏览

M

也欢迎大家加我的微信MengkeLyuActuary，我会把大家加到我的代码群
A

SnakeBytes 2/25: Corbin Barrels Slade
• ArizonaDiamond

1

0
赞同

1
帖子

2
浏览

尚无回复
M

[小技巧] 如何快速批量复制粘贴并重命名文件(Windows)
• Mengkelyu

1

0
赞同

1
帖子

3
浏览

尚无回复
M

[Youtube学习笔记]Linux的50个实用指令
• Mengkelyu

2

0
赞同

2
帖子

9
浏览

M

cat file #print all contents of a file 输出一个文件中的全部内容 cat file1 file2 #print all contents of file1 and file2 输出文件1和文件2的全部内容 cat -n file1 #-n is to print out the line numbers 加入参数n，可以同时输出内容和行数 less <file># it shows the content stored inside a file, in a nice and interactive UI； cat的漂亮形式 echo #it prints to the output the argument passed to it 这个命令等于print wc #print number of lines; number of words; number of bits 输出基本信息，（多少行，多少数字，多少bit） | #give the output of the first argument to the second argument 这个符号可以把符号前面的Output传递给符号后面的命令作为Input ls -l | wc #word count for the ls -l 比如这个例子，ls -l 输出了当前文件夹下所有文件的名字，然后这个文本传递给了wc输出文本的基本信息 sort #sort the document 给文件排序 sort -n file #sort everything numerically 按照数字顺序给文件排序 uniq #report or omit repeated lines uniq -d #only report duplicated values uniq -u #only report non-duplicated values uniq -C #count how many times each value is duplicated *.txt #match all documents ending with txt ? #match one character {a,b,c}.txt #output a.txt, b.txt, c.txt Day{1..365} #output Day1 Day2 Day3 .... Day 365 diff #compare the contents of two files diff -v file1 file2 #compare two files line by line find <directory> <critria> find . -name #*.txt# find . -iname #*.Txt# #i for case incensitive find . -type d #only search for directory find . -type f #only search for files find . -name #E*# -or -name #*E# find . -type f exec cat {} \ #exec cat to every file. {} is placeholder for each found result grep #help search inside files grep <string to be found> <text to search> grep -r <string to be found> <directory> #recursively found a string in all texts in a folder. -ri can make it case incensitive du #find the size of directories or files df #show how many space left history #history of all commands run. Combined with grep, it is possible to obtain the command with certain character ps #view the process running on the computer top #display top 10 process kill #shut a process kill <PID> kill -9 #brutal kill kill -15 #gental kill killall #perform multiple kills jobs #print out commands fg <job number> #run some command on the foreground bg <job number> #run some command on the background <command> & #run the command on the backgound jobs #see those commands gzip <file name>#compress a file gzip -k <file name> #compress a file while keeping the original file gzip -d <file name>#decompress a file gunzip <file name>#decompress a file tar #used to create an archive, grouping multiple files in a single file tar -cf archive.tar file1 file2 #the c option standas for create, the f option is used to write to file the archive tar -xf #extract to current folder tar -czf archive.tar.gz file1 file2 #creating a tar achive and then running gzip on it nano #a friendly editor alias <new command name> = '<old command>'# Give some alias to a long command # difference of "" and '', "" allows variable in a string echo "home directory \$PWD" # 这里\$PWD被当作成了变量或参数 echo 'home directory \$PWD' # 这里\$PWD没有被当作成变量或参数 xargs # turn the output of first argument to the input of second arguement' find x -size +1M | xargs ls -lh ln # it is used to create links. Links are effectively a pointer to another file # Two types of links: hard links and soft links ln <file1> <file2> # create hard link to file1 ln -s <file1> <file2># create soft link # soft link will be removed when original file is removed. But hard link will not be removed
M

[小技巧] 如何快速Set Up文件夹（适用于Windows)
• Mengkelyu

1

0
赞同

1
帖子

11
浏览

尚无回复
M

免费开源Programming书籍集合
• Mengkelyu

1

0
赞同

1
帖子

6
浏览

尚无回复
H

经典教材：Introduction to Data Mining_2ed_Pang Ning Tan
• Howie Jie

1

0
赞同

1
帖子

3
浏览

尚无回复
M

精算用的软件集合
• Mengkelyu

1

0
赞同

1
帖子

13
浏览

尚无回复
M

一行代码EDA：Python和R中可以用一行代码进行数据探索性分析的包介绍和对比
• Mengkelyu

9

0
赞同

9
帖子

25
浏览

M

AutoViz 照例，先安装 python -m venv au-ven au-ven\Scripts\activate.ps1 pip install autoviz 然后报了错，首先是第一个错是 ModuleNotFoundError: No module named 'scipy.special.cython_special'上网查了一下，原来是scipy版本不兼容，于是乎把scipy降到了1.4.1版本 python -m pip install scipy==1.4.1 之后发现，原来必须要用jupyter来跑这个包，因为它无法像Sweetiviz等包生成html版本的report。 from autoviz.AutoViz_Class import AutoViz_Class # 数据集来源：https://github.com/hmix13/AutoViz AV = AutoViz_Class() df = AV.AutoViz('car_design.csv') Jupyter Lab -> There you go! 说实话，没想到这个包震惊到了我，它的图实在太好康了
M

从动态网页中爬Request的经验
• Mengkelyu

2

0
赞同

2
帖子

10
浏览

M

由于各种原因，不方便透露是哪两个网站。但是可以和大家分享一下一些心得。第一个网站中，关键的突破点是通过浏览器的网络监控（Chrome点击F12，并选择Network如下图），点击Ctrl+F，输入想要爬取信息的关键词（我这里是“平安”），成功检索到了网站提取信息的Url。第二个网站就更有意思了，我发现这个网页是首先把所有的数据提取出来后，分页完全是前端进行的。而数据变量还存储在浏览器中。我找到数据变量"List"后，在浏览器console运行了如下的js代码，这是一个用于储存浏览器中变量的函数。 (function(console){ console.save = function(data, filename){ if(!data) { console.error('Console.save: No data') return; } if(!filename) filename = 'console.json' if(typeof data === "object"){ data = JSON.stringify(data, undefined, 4) } var blob = new Blob([data], {type: 'text/json'}), e = document.createEvent('MouseEvents'), a = document.createElement('a') a.download = filename a.href = window.URL.createObjectURL(blob) a.dataset.downloadurl = ['text/json', a.download, a.href].join(':') e.initMouseEvent('click', true, false, window, 0, 0, 0, 0, 0, false, false, false, false, 0, null) a.dispatchEvent(e) } })(console) 然后运行 console.save(List, "List.txt") 变量List就存储到List.txt这个文档中了。基本上整个过程是无代码爬虫。
M

如何用Bokeh和ggplot制作Grouped and Stacked Column Chart
• Mengkelyu

4

0
赞同

4
帖子

10
浏览

M

Bokeh的效果图：
M

三种Ensemble的办法（Bagging, boosting, stacking)
• Mengkelyu

10

0
赞同

10
帖子

24
浏览

M

根据我实际的测试，Bagging后的结果确实比单次的结果好很多。大概能提升AUC 0.01左右
M

缺失值处理的方法
• Mengkelyu

1

0
赞同

1
帖子

17
浏览

尚无回复

1 / 1

接VBA, Python, R咨询 • Mengkelyu

SnakeBytes 2/25: Corbin Barrels Slade • ArizonaDiamond

[小技巧] 如何快速批量复制粘贴并重命名文件(Windows) • Mengkelyu

[Youtube学习笔记]Linux的50个实用指令 • Mengkelyu

[小技巧] 如何快速Set Up文件夹（适用于Windows) • Mengkelyu

免费开源Programming书籍集合 • Mengkelyu

经典教材：Introduction to Data Mining_2ed_Pang Ning Tan • Howie Jie

精算用的软件集合 • Mengkelyu

一行代码EDA：Python和R中可以用一行代码进行数据探索性分析的包介绍和对比 • Mengkelyu

从动态网页中爬Request的经验 • Mengkelyu

如何用Bokeh和ggplot制作Grouped and Stacked Column Chart • Mengkelyu

三种Ensemble的办法（Bagging, boosting, stacking) • Mengkelyu

缺失值处理的方法 • Mengkelyu

接VBA, Python, R咨询
• Mengkelyu

SnakeBytes 2/25: Corbin Barrels Slade
• ArizonaDiamond

[小技巧] 如何快速批量复制粘贴并重命名文件(Windows)
• Mengkelyu

[Youtube学习笔记]Linux的50个实用指令
• Mengkelyu

[小技巧] 如何快速Set Up文件夹（适用于Windows)
• Mengkelyu

免费开源Programming书籍集合
• Mengkelyu

经典教材：Introduction to Data Mining_2ed_Pang Ning Tan
• Howie Jie

精算用的软件集合
• Mengkelyu

一行代码EDA：Python和R中可以用一行代码进行数据探索性分析的包介绍和对比
• Mengkelyu

从动态网页中爬Request的经验
• Mengkelyu

如何用Bokeh和ggplot制作Grouped and Stacked Column Chart
• Mengkelyu

三种Ensemble的办法（Bagging, boosting, stacking)
• Mengkelyu

缺失值处理的方法
• Mengkelyu