<Modern Information retrieval> notes
用户搜索界面Marchionini给出了信息查找(information lookup)和探索式搜索(exploratory search)。信息查找就像数据库中查找信息,输入最简单的信息即可完成。探索式搜索分为学习和调查,学习搜索需要多个查询响应对,用户需要花费时间读取多个信息项,综合这些内容。调查是一个更长期的过程,在一段长时间内进行多次迭代,对返回结果进行评估。
现在的模型强调搜索过程的动态特性,用户在搜索的同时也在学习,当他们看到检索结果时,其信息需求会进行相应调整。这种动态过程称为采摘模型(berry picking model)。
有时用户把一个复杂的不容易查到的查询分解成多个简单的定向的查询,这种策略称为定向(orienteering)。
信息搜寻理论(IFT)利用了进化论的观点。
搜索引擎用了导航结构(navigation),某个交互界面可能需要数次点击来引领搜索用户寻找他们的目标。
搜索界面中还用了深度链接和站内链接,现在也在用。
在查询产生了一定结果之后,超过50%的用户至少进行了1次查询修改。搜索界面越来越多使用相关项建议技术,通常称为查询项扩展(term ex ...
4.10 Kaggle competition_predict house price
4.10.1 download and load datasets非常全的文件下载保存操作123456789101112#we will download different datasets and we will realize some functions to download them easily.#Firstly we set a dictionary DATA_HUB, which can map names of datasets to 2-tuples related to datasets#The 2-tuple includes urls of datasets and sha-1 sercet keys to verify the file completion.#All the datasets like these are stored onto website with address DATA_URLimport hashlibimport osimport tarfileimport zipfile
1234import requests#@save ...
4.9 environment and deviation types
12345678910#1.协变量偏移,比如说我们训练集是一组真实的猫狗图像,但是测试集却给出了动画猫狗图像。#也就是说,假设在一个与测试集的特征有本质特征的数据集上进行训练,如果没有方法来适应新的领域,可能会有麻烦。#2.标签偏移,label shift描述了与协变量偏移相反的问题。这里我们假设标签边缘概率P(y)可以改变,但是类别条件分布P(x|y)在不同的领域保持不变#当我们认为y导致x时,标签偏移就是一个合理的假设。比如我们判断疾病时,可以根据症状来判断,即使疾病的相对流行旅随着时间而变化。也就是疾病就是#标签,会发生一定变化,但是症状并不改变。#概念偏移,比如标签的定义发生变化了,例如精神疾病的诊断标准,工作头衔等
4.9.2 distribution deviation examples123#比如自动驾驶汽车,我们想要通过机器学习开发自动驾驶。#当我们需要学习路沿,开发路沿检测器,有人直接把游戏中的路沿数据拿来额外的训练数据,在测试的时候也会非常有效!虽然很快就学习到了这个渲染的特征#但在真正的应用中就是灾难。
4.9.3 distribution deviation c ...
4.7 forward propagation, backward propagation and calculation graph
1234#forward propagation indicates from input layers to output layers calculating and storaging results in each layer.#we all know backward propagation is to calculate derivatives of each W(W1, W2), so forward propagation is just #forward calculate, note: it is not derivatives, it is just XW, then ReLU(XW), then go to the next layer and get the output.
4.8 numerical stability and model initializing12#previously we assign a distribution to initialize parameters like W and b, but initializing sche ...
4.6 Temporary Retirement
4.6 Temporary Retirement1234#a optimizing algorithm which is also called "Dropout".#It can prevent overfitting on the datasets, so Dropout drop some neural units randomly during every iteration.#So, in the neural network, each neuron cannot rely on other specific neurons. #It seems like training some subnetworks, which can reduce the overfitting and improve the robustness.
1#we add noise in the inputs in each layer with normal distributions.
4.6.4 starting from zero1234567891011121314# ...
4.5 weight reduction
4.5 weight reduction12345#我们为了减小过拟合,也就是死记硬背不能达到学习的目的,我们加入惩罚项,使其不能死记硬背#将原来的训练目标:最小化训练标签上的预测损失,调整为最小化预测损失和惩罚项之和。#we take L2 norm as the penalty term. If the weight vector increases largely, our learning algorithm will more #concentrate on minimizing weight norm ||w||^2.#more details we will make up in the future.
4.5.2 high-dimension linear regression1234%matplotlib inlineimport torchfrom torch import nnfrom d2l import torch as d2l
1234567891011#firstly we generate some data like previously, with ...
KNN algorithm
算法原理KNN算法(k-near-neighbours algorithm)可用来将样本分类,也可以用来预测样本走向。我们本次先说归类。
借鉴一个图片,
通俗地讲,图中绿色的是未知点,红色和蓝色是两个标签,已经分好类了,我们要判断绿色属于红色类还是蓝色类。于是我们选择了参数K=3,从绿色的点旁边找三个距离最近的点,然后计算红色点和蓝色点的占比,哪个占比大,就属于哪一类。
K的选取很重要,因为K很小容易出现过拟合,K很大,在回归问题中,如果样本点呈现二次函数走势,取很多很远的样本纳入考虑,会使得走势不准,也就是欠拟合。
所以我们的算法流程如下
123计算待分类点与已知类别点之间的距离将所有点按照距离递增次序排序,选择参数K,取距离待定点最近的K个点计算K个点中不同类别的占比,占比最大的类,即为结果。
伪代码1234567891011121314151617181920212223#我们假设dataset是二维数组,每个向量之中两个分量,第一个表示类别比如红色,第二个表示值。int a#待定点dataset.reshape(-1, 2)#也就是修改成列向量的形式。for i in ran ...
cryptography
我们平时的key, cer等等密钥文件,打开时并不是我们喜欢的n,p,q的公私钥形式,而是已经编码好的形式,而编码的规则就是基于X.690标准中的ASN.1的编码格式:Basic Encoding Rules(BER), Canonical Encoding Rules(CER), Distinguished Encoding Rules(DER).
The BER were origin rules laid out by the ASN.1 standard for encoding data into a binary format.The rules used octets(8bit bytes) to encode data.
X.680 defines a syntax for declaring data types, for example: booleans, numbers, strings and compound structures. Each type definition also includes an identifying number.
X.68 ...
4.4 model choosing, underfitting and overfitting
4.4.1 polynomial regression12345import mathimport numpy as npimport torchfrom torch import nnfrom d2l import torch as d2l
4.4.4多项式回归1. generate datasets1234#中间内容补充:#泛化误差:我们通过将模型应用于一个独立的测试集来估计泛化误差,随机抽取该测试集,测试集不能在训练样本中出现过,避免过拟合(overfitting)。#所以泛化误差也就是模型应用在同样从原始样本的分布中抽取的无限多数据样本上,误差的期望。#训练误差:模型在训练数据集上得到的误差。
1234567#给定x,我们将使用以下三阶多项式来生成训练数据和测试数据的标签:#y = 5 + 1.2x - 3.4x^2 / 2! + 5.6x^3 / 3! + epsilon其中epsilon满足正态分布N(0,0.1^2)max_degree = 20 #多项式最大阶数n_train, n_test = 100, 100true_w = np.zeros(max_degree) ...
4.1 multilayer percepton多层感知机
4.1 multilayer percepton基础知识先略过,以后补充。123456#Activation function calculate weight and add bias to determine whether neuron should be activated.#They transform input signals into output differentiable operations.#Most activation functions are nonlinear%matplotlib inlineimport torchfrom d2l import torch as d2l
1.Relu函数1234#rectified linear unit, ReLU = max(x, 0) namely discard all the minus elements.x = torch.arange(-8.0, 8.0, 0.1, requires_grad=True)y = torch.relu(x)d2l.plot(x.detach(), y.detach() ...



