Najbolji način da se nauči što je više osnovnih pojmova o statistici i vjerojatnosti je kroz primjere.U ovom slučaju koristiti ćemo se bazom podataka koja se odnosi na korištenje javnih bicikli u jednom američkom gradu. Kako se radi o Amerikancima, možete biti sigurni da je analitika posao koji znaju kako se radi. Primjere za to možete naći od gospodarstva, obrazovanja, sporta. U donošenju odluka oslanjaju se na ankete i rezultate istraživanja.
Riječ je o sustavu s 500 bicikli, koje se nalaze na 50 stanica – mjesta raspoređenih u jednom gradu. Na svakoj stanici postoji sustav za zaključavanje i popratni kiosk gdje korisnici usluge mogu platiti godišnju članarinu, jednodnevnu ili trodnevnu naknadu za korištenje.
trip_id — Jedinstvena oznaka dodijeljena vožnji Starttime — dan i vrijeme kada je vožnja započeta Stoptime — dan i vrijeme kada je vožnja završena Bikeid — jedinstvena oznaka dodijeljena biciklu Tripduration –Vrijeme trajanja vožnje u sekundama from_station_name –stanica polaska to_station_name –stanica dolaska from_station_id –jedinstvena oznaka stanice polaska to_statio_id –jedinstavena oznaka stanice dolaska Usertype — vrijednost koja može biti jednokratni korisnik ili član Gender –spol vozača Birthyear — godina rođenja
In [1]:
# uvoz paketa potrebnih za analizu podataka
%matplotlib inline
import random
import datetime
import pandas as pd
import matplotlib.pyplot as plt
import statistics
import numpy as np
import scipy
from scipy import stats
import seaborn 

Istraživačka analiza podataka ili EDA ( Exploratory-Data-Analysis )

In [2]:
#analizu ćemo početi učitavanjem podataka u memoriju
data = pd.read_csv('trip.csv')
In [3]:
#veličina skupa na kojem radimo 
print(len(data))
286858
In [4]:
#pregled početnih redova uvezenog skupa podataka
#nazive stupaca nećemo mijenjati, kako bi vjerodostojnije mogli donositi buduće zaključke
data.head()
Out[4]:
trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype gender birthyear
0 431 10/13/2014 10:31 10/13/2014 10:48 SEA00298 985.935 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing… CBD-06 PS-04 Member Male 1960.0
1 432 10/13/2014 10:32 10/13/2014 10:48 SEA00195 926.375 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing… CBD-06 PS-04 Member Male 1970.0
2 433 10/13/2014 10:33 10/13/2014 10:48 SEA00486 883.831 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing… CBD-06 PS-04 Member Female 1988.0
3 434 10/13/2014 10:34 10/13/2014 10:48 SEA00333 865.937 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing… CBD-06 PS-04 Member Female 1977.0
4 435 10/13/2014 10:34 10/13/2014 10:49 SEA00202 923.923 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing… CBD-06 PS-04 Member Male 1971.0
In [5]:
#pregled pet zadnjih redova uvezenog skupa podataka
data.tail() 
Out[5]:
trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype gender birthyear
286853 255241 8/31/2016 23:34 8/31/2016 23:45 SEA00201 679.532 Harvard Ave & E Pine St 2nd Ave & Spring St CH-09 CBD-06 Short-Term Pass Holder NaN NaN
286854 255242 8/31/2016 23:48 9/1/2016 0:20 SEA00247 1965.418 Cal Anderson Park / 11th Ave & Pine St 6th Ave S & S King St CH-08 ID-04 Short-Term Pass Holder NaN NaN
286855 255243 8/31/2016 23:47 9/1/2016 0:20 SEA00300 1951.173 Cal Anderson Park / 11th Ave & Pine St 6th Ave S & S King St CH-08 ID-04 Short-Term Pass Holder NaN NaN
286856 255244 8/31/2016 23:49 9/1/2016 0:20 SEA00047 1883.299 Cal Anderson Park / 11th Ave & Pine St 6th Ave S & S King St CH-08 ID-04 Short-Term Pass Holder NaN NaN
286857 255245 8/31/2016 23:49 9/1/2016 0:20 SEA00442 1896.031 Cal Anderson Park / 11th Ave & Pine St 6th Ave S & S King St CH-08 ID-04 Short-Term Pass Holder NaN NaN
In [6]:
#važno je znati s kojom vrstom podataka radimo 
#trip_id, tripduration,birthyear - brojevi (float64)
#ostale kolone -- objekt --> najčešće string, znakovi, tekst
data.dtypes
Out[6]:
trip_id                int64
starttime             object
stoptime              object
bikeid                object
tripduration         float64
from_station_name     object
to_station_name       object
from_station_id       object
to_station_id         object
usertype              object
gender                object
birthyear            float64
dtype: object
Obzirom na mjernu skalu podatke možemo podijeliti u dvije grupe:
  • kvalitativne varijable ( nominalna i redoslijedna )
  • kvantitativne varijable ( intervalna i omjerna )
Nominalna — podaci se mogu klasificirati prema modalitetima, koji se nižu abecedno ili prema učestalosti. Računske operacije nad ovom vrstom podataka nisu dozvoljene. Modaliteti ( kategorije ) su izraženi rječima. Dijelimo ih na:
  • atributivne(spol, boja kose, način plaćanja, zanimanje, djelatnost, pripadnost političkoj stranci)
  • geografske (mjesto rođenja, porijeklo turista, mjesto prebivanja, zemlja porijekla uvezene robe)
Redoslijedna — podaci se mogu rangirati prema intenzitetu mjerenog svojstva. Računske operacije nisu dozvoljene. Koriste se operatori =,< i >. Modaliteti(kategorije)su izraženi riječima(kodovima).
  • primjeri ( ocjene kvalitete proizvoda, ocjena znanja, stupanj zadovoljstva potrošača, stupanj stručne spreme, stupanj ekonomske razvijenosti)
Intervalna — jednake razlike između brojeva ukazuju na jednaku razliku mjerenih svojstava, a nula je određena dogovorno. Računske operacije zbrajanja i oduzimanja su dozvoljene. Vrijednosti se redaju po veličini.
  • primjeri ( temperatura u celzijusima, temperatura u kelvinima )
Omjerna — jednake razlike brojeva ukazuju na jednaku razliku mjerenih svojstava, a nula znači nepostojanje svojstva. Dozvoljene su sve računske operacije. Vrijednosti se redaju po veličini.
  • diskretne ( broj bodova na ispitu, broj članova kućanstva, broj učenika i razredu,broj obuće )
  • kontinuirane (prihod, plaća, cijena, visina, težina, udaljenost)
U našem primjeru imamo slijedeću situaciju s varijablama:
  • trip_id,bike_id,tripduration,from_station_id,to_station_id,birthyear: kontinuirana
  • Starttime, Stoptime: DateTime
  • from_station_name, to_station_name: String
  • Usertype gender: nominalna

Jednodimenzionalna analiza

In [7]:
# analiza koja se obavlja na jednoj varijabli iz skupa podataka
# u našem slučaju važno nam je saznati za koje razdoblje imamo podatke

#dataframe je naziv za strukturu podataka u Pythonu ( tablicu )

#prvo ćemo presložiti skup podataka prema stupcu "starttime" 
#primjećujemo kako su podaci u tom stupcu sada posloženi prema vremenskom tijeku ( prvi datum --> zadnji datum )
data = data.sort_values(by='starttime')
In [8]:
data.head()
Out[8]:
trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype gender birthyear
71032 25091 1/1/2015 0:24 1/1/2015 0:48 SEA00325 1403.479 Lake Union Park / Valley St & Boren Ave N 12th Ave & E Mercer St SLU-17 CH-15 Short-Term Pass Holder NaN NaN
20239 25091 1/1/2015 0:24 1/1/2015 0:48 SEA00325 1403.479 Lake Union Park / Valley St & Boren Ave N 12th Ave & E Mercer St SLU-17 CH-15 Short-Term Pass Holder NaN NaN
20240 25092 1/1/2015 0:37 1/1/2015 0:44 SEA00267 459.469 Harvard Ave & E Pine St Cal Anderson Park / 11th Ave & Pine St CH-09 CH-08 Member Male 1991.0
71033 25092 1/1/2015 0:37 1/1/2015 0:44 SEA00267 459.469 Harvard Ave & E Pine St Cal Anderson Park / 11th Ave & Pine St CH-09 CH-08 Member Male 1991.0
71034 25093 1/1/2015 0:44 1/1/2015 0:48 SEA00124 255.004 Harvard Ave & E Pine St REI / Yale Ave N & John St CH-09 SLU-01 Member Male 1987.0
In [9]:
#sortiranjem podataka promijenili smo raspored elemenata
#dodjeljujemo novu indeksaciju temeljem podataka iz stupca starttime
#imamo novu kolonu bez naziva
data.reset_index()
Out[9]:
index trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype gender birthyear
0 71032 25091 1/1/2015 0:24 1/1/2015 0:48 SEA00325 1403.479 Lake Union Park / Valley St & Boren Ave N 12th Ave & E Mercer St SLU-17 CH-15 Short-Term Pass Holder NaN NaN
1 20239 25091 1/1/2015 0:24 1/1/2015 0:48 SEA00325 1403.479 Lake Union Park / Valley St & Boren Ave N 12th Ave & E Mercer St SLU-17 CH-15 Short-Term Pass Holder NaN NaN
2 20240 25092 1/1/2015 0:37 1/1/2015 0:44 SEA00267 459.469 Harvard Ave & E Pine St Cal Anderson Park / 11th Ave & Pine St CH-09 CH-08 Member Male 1991.0
3 71033 25092 1/1/2015 0:37 1/1/2015 0:44 SEA00267 459.469 Harvard Ave & E Pine St Cal Anderson Park / 11th Ave & Pine St CH-09 CH-08 Member Male 1991.0
4 71034 25093 1/1/2015 0:44 1/1/2015 0:48 SEA00124 255.004 Harvard Ave & E Pine St REI / Yale Ave N & John St CH-09 SLU-01 Member Male 1987.0
5 20241 25093 1/1/2015 0:44 1/1/2015 0:48 SEA00124 255.004 Harvard Ave & E Pine St REI / Yale Ave N & John St CH-09 SLU-01 Member Male 1987.0
6 71052 25131 1/1/2015 10:14 1/1/2015 10:33 SEA00204 1145.254 Summit Ave E & E Republican St Occidental Park / Occidental Ave S & S Washing… CH-03 PS-04 Short-Term Pass Holder NaN NaN
7 20259 25131 1/1/2015 10:14 1/1/2015 10:33 SEA00204 1145.254 Summit Ave E & E Republican St Occidental Park / Occidental Ave S & S Washing… CH-03 PS-04 Short-Term Pass Holder NaN NaN
8 71053 25132 1/1/2015 10:55 1/1/2015 11:03 SEA00391 470.801 E Pine St & 16th Ave 7th Ave & Union St CH-07 CBD-03 Member Male 1980.0
9 20260 25132 1/1/2015 10:55 1/1/2015 11:03 SEA00391 470.801 E Pine St & 16th Ave 7th Ave & Union St CH-07 CBD-03 Member Male 1980.0
10 71054 25133 1/1/2015 10:59 1/1/2015 11:01 SEA00058 136.101 E Pine St & 16th Ave Cal Anderson Park / 11th Ave & Pine St CH-07 CH-08 Member Female 1986.0
11 20261 25133 1/1/2015 10:59 1/1/2015 11:01 SEA00058 136.101 E Pine St & 16th Ave Cal Anderson Park / 11th Ave & Pine St CH-07 CH-08 Member Female 1986.0
12 71055 25134 1/1/2015 11:09 1/1/2015 11:31 SEA00434 1294.421 Lake Union Park / Valley St & Boren Ave N Eastlake Ave E & E Allison St SLU-17 EL-05 Short-Term Pass Holder NaN NaN
13 20262 25134 1/1/2015 11:09 1/1/2015 11:31 SEA00434 1294.421 Lake Union Park / Valley St & Boren Ave N Eastlake Ave E & E Allison St SLU-17 EL-05 Short-Term Pass Holder NaN NaN
14 71056 25135 1/1/2015 11:31 1/1/2015 11:55 SEA00079 1461.638 12th Ave & E Denny Way Key Arena / 1st Ave N & Harrison St CH-06 SLU-19 Member Male 1987.0
15 20263 25135 1/1/2015 11:31 1/1/2015 11:55 SEA00079 1461.638 12th Ave & E Denny Way Key Arena / 1st Ave N & Harrison St CH-06 SLU-19 Member Male 1987.0
16 71057 25136 1/1/2015 11:34 1/1/2015 11:45 SEA00107 666.286 Eastlake Ave E & E Allison St 15th Ave NE & NE 40th St EL-05 UW-04 Short-Term Pass Holder NaN NaN
17 20264 25136 1/1/2015 11:34 1/1/2015 11:45 SEA00107 666.286 Eastlake Ave E & E Allison St 15th Ave NE & NE 40th St EL-05 UW-04 Short-Term Pass Holder NaN NaN
18 20265 25137 1/1/2015 11:45 1/1/2015 11:54 SEA00401 524.885 Harvard Ave & E Pine St Summit Ave E & E Republican St CH-09 CH-03 Member Female 1987.0
19 71058 25137 1/1/2015 11:45 1/1/2015 11:54 SEA00401 524.885 Harvard Ave & E Pine St Summit Ave E & E Republican St CH-09 CH-03 Member Female 1987.0
20 20266 25138 1/1/2015 11:47 1/1/2015 16:20 SEA00031 16390.548 NE 42nd St & University Way NE NE 42nd St & University Way NE UD-02 UD-02 Short-Term Pass Holder NaN NaN
21 71059 25138 1/1/2015 11:47 1/1/2015 16:20 SEA00031 16390.548 NE 42nd St & University Way NE NE 42nd St & University Way NE UD-02 UD-02 Short-Term Pass Holder NaN NaN
22 20267 25139 1/1/2015 11:48 1/1/2015 16:20 SEA00389 16321.122 NE 42nd St & University Way NE NE 42nd St & University Way NE UD-02 UD-02 Short-Term Pass Holder NaN NaN
23 71060 25139 1/1/2015 11:48 1/1/2015 16:20 SEA00389 16321.122 NE 42nd St & University Way NE NE 42nd St & University Way NE UD-02 UD-02 Short-Term Pass Holder NaN NaN
24 71061 25140 1/1/2015 11:56 1/1/2015 12:11 SEA00147 902.576 Lake Union Park / Valley St & Boren Ave N E Blaine St & Fairview Ave E SLU-17 EL-03 Short-Term Pass Holder NaN NaN
25 20268 25140 1/1/2015 11:56 1/1/2015 12:11 SEA00147 902.576 Lake Union Park / Valley St & Boren Ave N E Blaine St & Fairview Ave E SLU-17 EL-03 Short-Term Pass Holder NaN NaN
26 20269 25142 1/1/2015 12:00 1/1/2015 12:12 SEA00210 699.075 Lake Union Park / Valley St & Boren Ave N E Blaine St & Fairview Ave E SLU-17 EL-03 Short-Term Pass Holder NaN NaN
27 71062 25142 1/1/2015 12:00 1/1/2015 12:12 SEA00210 699.075 Lake Union Park / Valley St & Boren Ave N E Blaine St & Fairview Ave E SLU-17 EL-03 Short-Term Pass Holder NaN NaN
28 20270 25143 1/1/2015 12:13 1/1/2015 12:22 SEA00347 514.110 E Harrison St & Broadway Ave E Seattle University / E Columbia St & 12th Ave CH-02 FH-04 Member Female 1986.0
29 71063 25143 1/1/2015 12:13 1/1/2015 12:22 SEA00347 514.110 E Harrison St & Broadway Ave E Seattle University / E Columbia St & 12th Ave CH-02 FH-04 Member Female 1986.0
286828 179356 141641 9/9/2015 9:20 9/9/2015 9:36 SEA00052 980.237 Pier 69 / Alaskan Way & Clay St Pier 69 / Alaskan Way & Clay St WF-01 WF-01 Short-Term Pass Holder NaN NaN
286829 179357 141642 9/9/2015 9:20 9/9/2015 9:29 SEA00341 579.111 3rd Ave & Broad St 2nd Ave & Blanchard St BT-01 BT-05 Short-Term Pass Holder NaN NaN
286830 179358 141643 9/9/2015 9:25 9/9/2015 9:30 SEA00328 310.180 E Pine St & 16th Ave Pine St & 9th Ave CH-07 SLU-16 Member Male 1986.0
286831 179359 141644 9/9/2015 9:27 9/9/2015 9:33 SEA00112 402.045 12th Ave & E Mercer St Pine St & 9th Ave CH-15 SLU-16 Member Female 1992.0
286832 179360 141645 9/9/2015 9:27 9/9/2015 9:35 SEA00365 482.761 2nd Ave & Vine St 9th Ave N & Mercer St BT-03 DPD-01 Member Male 1985.0
286833 179361 141646 9/9/2015 9:30 9/9/2015 10:07 SEA00103 2210.221 Key Arena / 1st Ave N & Harrison St Key Arena / 1st Ave N & Harrison St SLU-19 SLU-19 Short-Term Pass Holder NaN NaN
286834 179362 141648 9/9/2015 9:30 9/9/2015 9:37 SEA00407 433.452 Key Arena / 1st Ave N & Harrison St Westlake Ave & 6th Ave SLU-19 SLU-15 Member Male 1988.0
286835 179363 141649 9/9/2015 9:30 9/9/2015 9:44 SEA00220 850.295 Dexter Ave & Denny Way E Blaine St & Fairview Ave E SLU-18 EL-03 Member Male 1980.0
286836 179364 141652 9/9/2015 9:33 9/9/2015 9:43 SEA00213 609.179 Seattle Aquarium / Alaskan Way S & Elliott Bay… 3rd Ave & Broad St WF-04 BT-01 Short-Term Pass Holder NaN NaN
286837 179365 141653 9/9/2015 9:35 9/9/2015 9:44 SEA00473 511.330 E Pine St & 16th Ave Westlake Ave & 6th Ave CH-07 SLU-15 Member Other 1992.0
286838 179366 141654 9/9/2015 9:36 9/9/2015 9:46 SEA00352 617.689 Westlake Ave & 6th Ave Pier 69 / Alaskan Way & Clay St SLU-15 WF-01 Member Male 1981.0
286839 179367 141655 9/9/2015 9:36 9/9/2015 9:43 SEA00411 425.871 Pine St & 9th Ave 9th Ave N & Mercer St SLU-16 DPD-01 Member Male 1979.0
286840 179371 141659 9/9/2015 9:37 9/9/2015 9:52 SEA00235 884.904 2nd Ave & Vine St Union St & 4th Ave BT-03 CBD-04 Short-Term Pass Holder NaN NaN
286841 179369 141657 9/9/2015 9:37 9/9/2015 9:52 SEA00495 914.284 2nd Ave & Vine St Union St & 4th Ave BT-03 CBD-04 Short-Term Pass Holder NaN NaN
286842 179368 141656 9/9/2015 9:37 9/9/2015 9:43 SEA00449 389.698 Cal Anderson Park / 11th Ave & Pine St REI / Yale Ave N & John St CH-08 SLU-01 Member Male 1993.0
286843 179370 141658 9/9/2015 9:37 9/9/2015 9:46 SEA00185 527.781 3rd Ave & Broad St 2nd Ave & Pine St BT-01 CBD-13 Member Male 1985.0
286844 179372 141660 9/9/2015 9:39 9/9/2015 9:46 SEA00124 400.267 Summit Ave E & E Republican St Republican St & Westlake Ave N CH-03 SLU-04 Member Male 1992.0
286845 179373 141661 9/9/2015 9:40 9/9/2015 9:55 SEA00045 938.904 Eastlake Ave E & E Allison St Lake Union Park / Valley St & Boren Ave N EL-05 SLU-17 Short-Term Pass Holder NaN NaN
286846 179374 141662 9/9/2015 9:40 9/9/2015 9:46 SEA00197 335.593 Seattle Aquarium / Alaskan Way S & Elliott Bay… Pier 69 / Alaskan Way & Clay St WF-04 WF-01 Member Female 1950.0
286847 179375 141663 9/9/2015 9:40 9/9/2015 9:50 SEA00427 575.674 Cal Anderson Park / 11th Ave & Pine St Republican St & Westlake Ave N CH-08 SLU-04 Member Male 1991.0
286848 179376 141664 9/9/2015 9:41 9/9/2015 9:46 SEA00227 307.400 12th Ave & E Yesler Way City Hall / 4th Ave & James St CD-01 CBD-07 Member Male 1986.0
286849 179377 141665 9/9/2015 9:41 9/9/2015 9:51 SEA00404 567.853 Key Arena / 1st Ave N & Harrison St PATH / 9th Ave & Westlake Ave SLU-19 SLU-07 Member Male 1989.0
286850 179378 141666 9/9/2015 9:43 9/9/2015 9:55 SEA00293 732.077 E Harrison St & Broadway Ave E Occidental Park / Occidental Ave S & S Washing… CH-02 PS-04 Short-Term Pass Holder NaN NaN
286851 179379 141667 9/9/2015 9:45 9/9/2015 10:16 SEA00172 1827.683 Republican St & Westlake Ave N 2nd Ave & Spring St SLU-04 CBD-06 Short-Term Pass Holder NaN NaN
286852 179380 141668 9/9/2015 9:45 9/9/2015 9:55 SEA00080 603.804 Summit Ave & E Denny Way 1st Ave & Marion St CH-01 CBD-05 Member Male 1980.0
286853 179381 141669 9/9/2015 9:46 9/9/2015 9:54 SEA00460 473.064 E Pine St & 16th Ave Terry Ave & Stewart St CH-07 SLU-20 Member Male 1978.0
286854 179382 141670 9/9/2015 9:49 9/9/2015 9:54 SEA00328 321.262 Pine St & 9th Ave Republican St & Westlake Ave N SLU-16 SLU-04 Member Male 1983.0
286855 179383 141671 9/9/2015 9:49 9/9/2015 9:55 SEA00473 359.629 Westlake Ave & 6th Ave 9th Ave N & Mercer St SLU-15 DPD-01 Member Male 1970.0
286856 179384 141672 9/9/2015 9:55 9/9/2015 9:59 SEA00266 252.431 6th Ave & Blanchard St Republican St & Westlake Ave N BT-04 SLU-04 Member Male 1988.0
286857 179385 141673 9/9/2015 9:55 9/9/2015 10:00 SEA00117 288.925 Pier 69 / Alaskan Way & Clay St Seattle Aquarium / Alaskan Way S & Elliott Bay… WF-01 WF-04 Member Female 1982.0
286858 rows × 13 columns
In [10]:
data.head()
Out[10]:
trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype gender birthyear
71032 25091 1/1/2015 0:24 1/1/2015 0:48 SEA00325 1403.479 Lake Union Park / Valley St & Boren Ave N 12th Ave & E Mercer St SLU-17 CH-15 Short-Term Pass Holder NaN NaN
20239 25091 1/1/2015 0:24 1/1/2015 0:48 SEA00325 1403.479 Lake Union Park / Valley St & Boren Ave N 12th Ave & E Mercer St SLU-17 CH-15 Short-Term Pass Holder NaN NaN
20240 25092 1/1/2015 0:37 1/1/2015 0:44 SEA00267 459.469 Harvard Ave & E Pine St Cal Anderson Park / 11th Ave & Pine St CH-09 CH-08 Member Male 1991.0
71033 25092 1/1/2015 0:37 1/1/2015 0:44 SEA00267 459.469 Harvard Ave & E Pine St Cal Anderson Park / 11th Ave & Pine St CH-09 CH-08 Member Male 1991.0
71034 25093 1/1/2015 0:44 1/1/2015 0:48 SEA00124 255.004 Harvard Ave & E Pine St REI / Yale Ave N & John St CH-09 SLU-01 Member Male 1987.0
In [11]:
data.loc[0,'starttime'] #dobili smo informaciju koji je prvi datum s kojim počinjemo raditi
Out[11]:
'10/13/2014 10:31'
In [12]:
data.loc[len(data)-1,'stoptime'] # dobili smo zadnji datum za koji imamo podatke
Out[12]:
'9/1/2016 0:20'
In [13]:
#Kreiramo zajednički ispis dobivenih vrijednosti
print("Raspon podataka: %s - %s"%(data.loc[1,'starttime'],data.loc[len(data)-1,'stoptime']))
Raspon podataka: 10/13/2014 10:32 - 9/1/2016 0:20
In [14]:
# nastavljamo plotanjem distribucije korisnika prema tipu paketa
# imamo grupu člana i grupu kratkotrajnog korisnika 
# grupirati ćemo dvije grupe i zbrojiti njihove podatke
groupby_user = data.groupby('usertype').size()
# plotanje grafikona 
groupby_user.plot.bar(title = "Distribucija prema vrstama korisnika")
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x167ca588>
In [15]:
# studija je pokazala da 77 % korisnika usluge javnih bicikli u UK su osobe muškog spola.
# provjerimo što je s time u promatranom američkom gradu
groupby_gender = data.groupby('gender').size()
groupby_gender.plot.bar(title = 'Distribucija prema spolu')
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x13bc3e80>
Situacija sa spolom u odnosu na UK se ne razlikuje. Ako želimo definirati ciljanu publiku prema kojoj ćemo usmjeriti marketing jedno od rješenja je da plotamo distribuciju na bazi godišta korisnika
In [16]:
data=data.sort_values(by='birthyear')
groupby_birthyear = data.groupby('birthyear').size()
groupby_birthyear.plot.bar(title='Distribucija korisnika prema godištu', figsize = (15,4))
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x109cc828>
Najveći dio korisnika nalazi se u rasponu od 1977. do 1992.godine. Od toga preko 10 % korisnika usluga je rođeno 1987.godine ( millennialsi – rođeni početkom 1980-tih do kraja 1990-tih ). Nadalje, želimo vidjeti koliki udio korisnika u rasponu s najvećim vrijednostima ima člansku iskaznicu.
In [17]:
data_mil = data[(data['birthyear'] >= 1977) & (data['birthyear']<=1994)]
groupby_mil = data_mil.groupby('usertype').size()
groupby_mil.plot.bar(title='Distribucija prema tipu korisnika')
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x10964da0>

Multivarijatna analiza

Multivarijantna analiza uključuje višestruke varijable radi razumijevanja ponašanja promatranih subjekata. Čini se da je ovo učinkovit i realističan pristup s obzirom na činjenicu da su entiteti obično međusobno povezani.

Plotanje distribucije prema godinama rođenja i spolu

In [18]:
# u ovom dijelu izračunavam podatke 
# unstack koristimo kako bi vrijednosti jednog stupca podijelili u više stupaca
# grupiraj podatke prema godini rođenja i spolu
# pobroji podatke prema godini - koliko korisnika je rođena u određenoj godini
# stupac Gender podijeli prema kategorijama
# gdje nema podatka dodaj nulu
groupby_birthyear_gender = data.groupby(['birthyear','gender'])['birthyear'].count().unstack('gender').fillna(0)
In [19]:
groupby_birthyear_gender.head()
Out[19]:
gender Female Male Other
birthyear
1931.0 1.0 0.0 0.0
1936.0 0.0 8.0 0.0
1939.0 0.0 40.0 0.0
1942.0 0.0 4.0 0.0
1943.0 0.0 21.0 0.0
In [20]:
groupby_birthyear_gender[['Male','Female','Other']].plot.bar(title="Distribucija prema spolu i godini rođenja", stacked = True, figsize = (15,4))
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x13c559b0>
In [21]:
#kakva je situacija promatramo s aspekta godine rođenja i tipa usluge korisnika
groupby_birthyear_user = data.groupby(['birthyear','usertype'])['birthyear'].count().unstack('usertype').fillna(0)
In [22]:
groupby_birthyear_user.head()
Out[22]:
usertype Member
birthyear
1931.0 1.0
1936.0 8.0
1939.0 40.0
1942.0 4.0
1943.0 21.0
In [23]:
groupby_birthyear_user.tail()
Out[23]:
usertype Member
birthyear
1995.0 910.0
1996.0 355.0
1997.0 101.0
1998.0 36.0
1999.0 6.0
In [24]:
groupby_birthyear_user['Member'].plot.bar(title = 'Distribucija prema godini rođenja i vrsti korištenja usluge', stacked=True, figsize = (15,4))
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x1821dfd0>
U prethodnom koraku smo utvrdili da nemamo korisnike sa tzv.Short-Term Pass Holders propusnicom. Provjera da li je to točno?
In [25]:
data[data['usertype'] == 'Short-Term Pass Holder']['birthyear'].isnull().values.all()
Out[25]:
True
Povratna informacija je da nemamo podatke o osobama koje imaju tu vrstu iskaznice. Zaključak je da se informacije o godinama rođenja ne prikupljaju kod korištenja kratkotrajnih propusnica.
Imamo li podatke o spolu korisnika usluga za jednokratni paket? Odgovor je da nemamo.
In [26]:
data[data['usertype'] == 'Short-Term Pass Holder']['gender'].isnull().values.all()
Out[26]:
True
Slijedeći zadatak je utvrditi kakav je odnos između frekvencija korištenja usluge ( datuma i sata ). Te podatke imamo ali ako ih želimo prikazati u grafikonu moramo ih promijeniti u datetime format. Izvršiti ćemo dodatnu podijelu podatka, na godinu, mjesec, dan i sat ).
In [27]:
# pretvaranje stupca starttime u listu podataka
List_ = list(data['starttime'])
# pretvaranje string vrste podataka u python datetime objkete
List_ = [datetime.datetime.strptime(x, "%m/%d/%Y %H:%M") for x in List_]
# pretvaranje liste u objekt serije a konvertirali smo datume iz datetime objekta u Pandas objket
data['starttime_mod'] = pd.Series(List_,index=data.index)
data['starttime_date'] = pd.Series([x.date() for x in List_],index=data.index)
data['starttime_year'] = pd.Series([x.year for x in List_],index=data.index)
data['starttime_month'] = pd.Series([x.month for x in List_],index=data.index)
data['starttime_day'] = pd.Series([x.day for x in List_],index=data.index)
data['starttime_hour'] = pd.Series([x.hour for x in List_],index=data.index)
In [28]:
data.head()
Out[28]:
trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype gender birthyear starttime_mod starttime_date starttime_year starttime_month starttime_day starttime_hour
263686 231080 7/8/2016 16:22 7/8/2016 16:53 SEA00423 1817.524 E Pine St & 16th Ave E Pine St & 16th Ave CH-07 CH-07 Member Female 1931.0 2016-07-08 16:22:00 2016-07-08 2016 7 8 16
13302 16581 11/23/2014 16:46 11/23/2014 16:48 SEA00355 133.610 REI / Yale Ave N & John St REI / Yale Ave N & John St SLU-01 SLU-01 Member Male 1936.0 2014-11-23 16:46:00 2014-11-23 2014 11 23 16
64095 16581 11/23/2014 16:46 11/23/2014 16:48 SEA00355 133.610 REI / Yale Ave N & John St REI / Yale Ave N & John St SLU-01 SLU-01 Member Male 1936.0 2014-11-23 16:46:00 2014-11-23 2014 11 23 16
168216 129895 8/16/2015 17:21 8/16/2015 17:46 SEA00494 1530.681 Terry Ave & Stewart St Terry Ave & Stewart St SLU-20 SLU-20 Member Male 1936.0 2015-08-16 17:21:00 2015-08-16 2015 8 16 17
184490 147070 9/20/2015 16:06 9/20/2015 16:15 SEA00067 547.429 Terry Ave & Stewart St 2nd Ave & Pine St SLU-20 CBD-13 Member Male 1936.0 2015-09-20 16:06:00 2015-09-20 2015 9 20 16
In [29]:
data.groupby('starttime_date')['tripduration'].mean().plot.bar(title = 
'Distribucija prema trajanju i datumu vožnje', figsize = (15,4))
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x17bc8ba8>