Python基础知识整理



  • 早些时候整理的python基础知识,很多地方格式没有调整,希望大家见谅啦~

    change the working directory of anaconda

    In the terminal, run

    jupyter notebook --generate-config
    

    Modify the config file and restart Anaconda Navigator:

    Open the jupyter_notebook_config.py file in any suitable text editor and modify the “c.NotebookApp.notebook_dir” entry to point to the desired working directory. You will have to modify the “\” to “\” in your windows file path. Make sure to uncomment the line by removing the “#”.

    Save the file and restart the Anaconda Navigator.

    get current working directory

    os.getcwd()
    

    enumerate

    • loop through the items
    lst = ["app", "banana", "gig"]
    for thing in lst:
        print(thing)
    
    • index + item: use enumerate
    lst = ["app", "banana", "gig"]
    for idx, thing in enumerate(lst):
        print(idx)
        print(thing)
    

    How to find a subset of a list

    sublist = [i for i in list if i > x]
    

    The summation of list

    Similar to union_all

    a = [1,2,3]
    b = [3,3,4]
    a+b
    # [1, 2, 3, 3, 3, 4]
    

    Differences of loc and iloc and []

    • Difference between df['col_name'].values and df[['col_name']].values. The former gives a 1d array and the latter gives a 2d array

    • loc[] is the same as [] in most of the times!!! But it is better to call it explicitly

    • Avoid chain indexing!!! like Ax['s']['as']. It can be replaced by .loc['as','s']

    • The way to index on column name and row number without chain indexing

    df.loc[df.index[0], 'NAME']
    # or
    df.iloc[0, df.columns.get_loc("a")]
    
    • loc is label-based, which means that we have to specify the name of the rows and columns that we need to filter out.
      • For example, let’s say we search for the rows whose index is 1, 2 or 100. We will not get the first, second or the hundredth row here. Instead, we will get the results only if the name of any index is 1, 2 or 100.
    # select all rows with a condition
    data.loc[data.age >= 15]
    # select with multiple conditions
    data.loc[(data.age >= 12) & (data.gender == 'M')]
    # Select a range of rows using loc
    #slice
    data.loc[1:3]
    # Using loc, we can also slice the Pandas dataframe over a range of indices. If the indices are not in the sorted order, it will select only the rows with index 1 and 3
    # Select only required columns with a condition
    data.loc[(data.age >= 12), ['city', 'gender']]
    # update a column with condition
    data.loc[(data.age >= 12), ['section']] = 'M'
    # update multiple columns with condition
    data.loc[(data.age >= 20), ['section', 'city']] = ['S','Pune']
    
    # select a column
    data.loc[['col_name']]
    
    # select index + column
    data.loc[data.age >= 12,'col_name']
    
    • On the other hand, iloc is integer index-based. So here, we have to specify rows and columns by their integer index.
    # select rows with indexes
    data.iloc[[0,2]]
    # select rows with particular indexes and particular columns
    data.iloc[[0,2],[1,3]]
    # select a range of rows
    data.iloc[1:3]
    # select a range of rows and columns
    data.iloc[1:3,2:4]
    

    How to slice series

    # df_temp is of pandas.series object
    df_temp = df_all.isnull().sum(axis=0)
    df_temp[df_temp>0]
    
    • Select a particular column
    df['label']
    

    Basic picture

    # packages
    import matplotlib.pyplot as plt
    %matplotlib inline
    
    df_train['label'].value_counts().plot(kind='bar')
    # create fig in each sub graphs
    fig = plt.figure(figsize=(18 ,10))
    
    for idx, row in enumerate(images):
        ax = fig.add_subplot(2,3,idx + 1)
        ax.set_xticks([])
        ax.set_yticks([])
        pixels = df_train.iloc[row, 1:786].values.reshape((28,28))
        ax.imshow(pixels, cmap="gray")
        ax.set_title(df_train.iloc[row]['label'], fontsize = 24)
    

    Difference index, array, list

    Array / List

    Lists and arrays are used in Python to store data(any data type- strings, integers etc), both can be indexed and iterated also. Difference between lists and arrays are the functions that you can perform on them like for example when you want to divide an array by 4, the result will be printed on request but in case of a list, python will throw an error message.

    index

    Index, on the other hand, is immutable
    Index: Immutable ndarray implementing an ordered, sliceable set.

    Properties

    Index.values Return an array representing the data in the Index.
    Index.is_monotonic Alias for is_monotonic_increasing.
    Index.is_monotonic_increasing Return if the index is monotonic increasing (only equal or increasing) values.
    Index.is_monotonic_decreasing Return if the index is monotonic decreasing (only equal or decreasing) values.
    Index.is_unique Return if the index has unique values.

    dataframe.sum

    dataframe.sum(axis=0)
    按照行对每列进行sum

    drop / dropna / isna /fillna

    • index of missing values in a particular column
    idx_missing = df[column].isna()
    
    • find the rows withour missing values
    df[-idx_missing]
    df.loc[-idx_missing]
    
    • fill na with value
    # fill na with No College
    # inplace means apply the changes to the original dataframe and no output
    # will be produced
    nba["College"].fillna("No College", inplace = True) 
    # fill na with mean
    nba["College"].fillna(np.mean(nba["College"]), inplace = True) 
    # method : Method is used if user doesn’t pass any value. Pandas has different methods bfill/ ffill which fills the place with value in the Previous/Back respectively.
    nba["College"].fillna(method = "ffill", inplace = True)
    # 用空值前面的值去填充它
    
    

    Matrix operation

    In Python we can solve the different matrix manipulations and operations. Numpy Module provides different methods for matrix operations.

    • add() − add elements of two matrices.

    • subtract() − subtract elements of two matrices.

    • divide() − divide elements of two matrices.

    • multiply() − multiply elements of two matrices.

    • dot() − It performs matrix multiplication, does not element wise multiplication.

    • sqrt() − square root of each element of matrix.

    • sum(x,axis) − add to all the elements in matrix. Second argument is optional, it is used when we want to compute the column sum if axis is 0 and row sum if axis is 1.

    • “T” − It performs transpose of the specified matrix.

    import numpy
    # Two matrices are initialized by value
    x = numpy.array([[1, 2], [4, 5]])
    y = numpy.array([[7, 8], [9, 10]])
    #  add()is used to add matrices
    print ("Addition of two matrices: ")
    print (numpy.add(x,y))
    # subtract()is used to subtract matrices
    print ("Subtraction of two matrices : ")
    print (numpy.subtract(x,y))
    # divide()is used to divide matrices
    print ("Matrix Division : ")
    print (numpy.divide(x,y))
    print ("Multiplication of two matrices: ")
    print (numpy.multiply(x,y))
    print ("The product of two matrices : ")
    print (numpy.dot(x,y))
    print ("square root is : ")
    print (numpy.sqrt(x))
    print ("The summation of elements : ")
    print (numpy.sum(y))
    print ("The column wise summation  : ")
    print (numpy.sum(y,axis=0))
    print ("The row wise summation: ")
    print (numpy.sum(y,axis=1))
    # using "T" to transpose the matrix
    print ("Matrix transposition : ")
    print (x.T)
    

    lambda functions

    x = lambda a: a+1
    x = lambda a, b : a * b
    
    # Apply lambda function in dataframe
    df['Percent Growth'].apply(lambda x: x.replace('%', '')).astype('float')
    

    reshape numpy array

    numpy allow us to give one of new shape parameter as -1 (eg: (2,-1) or (-1,3) but not (-1, -1)). It simply means that it is an unknown dimension and we want numpy to figure it out. And numpy will figure this by looking at the 'length of the array and remaining dimensions' and making sure it satisfies the above mentioned criteria

    String formats

    The format() method formats the specified value(s) and insert them inside the string's placeholder.

    The placeholder is defined using curly brackets: {}. Read more about the placeholders in the Placeholder section below.

    The format() method returns the formatted string.

    txt1 = "My name is {fname}, I'am {age}".format(fname = "John", age = 36)
    txt2 = "My name is {0}, I'am {1}".format("John",36)
    txt3 = "My name is {}, I'am {}".format("John",36)
    

    Inside the placeholders you can add a formatting type to format the result

    • :< Left aligns the result (within the available space)
    • :> Right aligns the result (within the available space)
    • :^ Center aligns the result (within the available space)
    • := Places the sign to the left most position
    • :+ Use a plus sign to indicate if the result is positive or negative
    • :- Use a minus sign for negative values only
    • : Use a space to insert an extra space before positive numbers (and a minus sign befor negative numbers)
    • :, Use a comma as a thousand separator
    • :_ Use a underscore as a thousand separator
    • :b Binary format
    • :c Converts the value into the corresponding unicode character
    • :d Decimal format
    • :e Scientific format, with a lower case e
    • :E Scientific format, with an upper case E
    • :f Fix point number format :.2f means 2 digits are preserved
    • :F Fix point number format, in uppercase format (show inf and nan as INF and NAN)
    • :g General format
    • :G General format (using a upper case E for scientific notations)
    • :o Octal format
    • :x Hex format, lower case
    • :X Hex format, upper case
    • :n Number format
    • :% Percentage format

    Open a file

    The available modes are:

    Character String
    'r' open for reading (default)
    'w' open for writing, truncating the file first
    'x' open for exclusive creation, failing if the file already exists
    'a' open for writing, appending to the end of the file if it exists
    'b' binary model
    't' text mode (default)
    '+' open for updating (reading and writing)

    The default mode is 'r' (open for reading text, synonym of 'rt'). Modes 'w+' and 'w+b' open and truncate the file (先清空). Modes 'r+' and 'r+b' open the file with no truncation.

    As mentioned in the Overview, Python distinguishes between binary and text I/O. Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when 't' is included in the mode argument), the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.

    flatten

    numpy.ndarray.flatten() function

    The flatten() function is used to get a copy of an given array collapsed into one dimension.

    ‘C’ means to flatten in row-major (C-style) order. ‘F’ means to flatten in column-major (Fortran- style) order. ‘A’ means to flatten in column-major order if a is Fortran contiguous in memory, row-major order otherwise. ‘K’ means to flatten a in the order the elements occur in memory. The default is ‘C’.

    ndarray.flatten(order='C')
    

    underscore in python

    Underscore _ is considered as "I don't Care" or "Throwaway" variable in Python

    The underscore _ is also used for ignoring the specific values. If you don’t need the specific values or the values are not used, just assign the values to underscore.

    x, _, y = (1, 2, 3)
    
    >>> x
    1
    
    >>> y 
    3
    

    .copy()

    df_copy = df_all
    

    df_copy和df_all在这里会是联动的,只是称呼变了,就像vba里面的set一样

    df_copy = df_all.pd.copy()
    

    创造了一个新的object,两者是不联动的

    Flatten a list

    Given a list of l

    flat_list = [item for sublist in l for item in sublist]
    

    in in pandas

    data.isin([])


登录后回复