The kuphatikiza kwa zotsatira za Spark ndi ndondomeko zofunika pakusanthula ndi kukonza deta yochuluka. Spark, chimango chodziwika bwino chogawa, chimapereka zosankha zingapo kuti mulowe nawo ndikuphatikiza zotsatira za ntchito zomwe zachitika mdera lanu. M'nkhaniyi, tiwona njira ndi njira zosiyanasiyana zomwe Spark amapereka kuphatikiza zotsatira bwino. Kuyambira kuphatikiza ma RDD mpaka kugwiritsa ntchito ma aggregation, mupeza momwe mungagwiritsire ntchito bwino zomwe Spark amapereka kuti mupeze zotsatira zachangu, zolondola. muma projekiti anu Zambiri.
Kuphatikiza kwa RDD Ndi imodzi mwa njira zoyambira komanso zodziwika bwino zophatikizira zotsatira ku Spark. Ma RDD (Resilients Distributed Datasets) ndiye mawonekedwe ofunikira ku Spark, ndikulola kugawidwa ndi kufananiza. m'njira yothandiza. Mwa kuphatikiza ma RDD awiri kapena kuposerapo, ntchito monga mgwirizano, mphambano, kapena kusiyana kumatha kuchitidwa pakati pa seti ya data, motero kumapereka kusinthasintha kwakukulu kogwiritsa ntchito ndikuphatikiza zotsatira za ntchito zomwe zachitika ku Spark.
Njira inanso yophatikizira zotsatira mu Spark ndi kudzera mu ntchito zophatikizira. Zochita izi zimalola kuti zotsatira zingapo ziphatikizidwe kukhala chimodzi, pogwiritsa ntchito ntchito zophatikizira monga masamu, ma avareji, kuchuluka kapena kuchepera. Pogwiritsa ntchito machitidwewa, ndizotheka kupeza zotsatira zophatikizidwa ndi mwachidule kuchokera kuzinthu zambiri mu sitepe imodzi, zomwe zingakhale zothandiza makamaka pazochitika zomwe zimafunika kuwerengera ma metric kapena ziwerengero pa seti ya data.
Kuphatikiza pa kuphatikizika kwa RDD ndikuphatikiza ntchito, Spark imaperekanso njira zina zophatikizira zotsatira, monga kugwiritsa ntchito mitundu yodziunjikira komanso kugwiritsa ntchito kuchepetsa. Zosintha zamagulu zimakulolani kuti muphatikize zotsatira za njira yabwino pamalo amodzi, makamaka mukafuna kugawana zambiri pakati pa ntchito zosiyanasiyana. Kumbali inayi, ntchito zochepetsera zimalola zotsatira zambiri kuti ziphatikizidwe kukhala chotsatira chimodzi pogwiritsa ntchito ntchito yofotokozedwa ndi wogwiritsa ntchito. Njirazi zimapereka kusinthasintha kwakukulu ndikuwongolera momwe zotsatira zimaphatikizidwira mu Spark.
Mwachidule, kuphatikiza of zotsatira mu Spark Ndi njira yofunikira pakuwongolera ndi kusanthula kuchuluka kwakukulu kwa data. njira yabwino. Spark imapereka njira ndi njira zosiyanasiyana zophatikizira zotsatira, monga kuphatikiza ma RDD, ntchito zophatikizira, kugwiritsa ntchito mitundu yosiyanasiyana, komanso kuchepetsa ntchito. Pogwiritsa ntchito zida izi, opanga ndi akatswiri atha kupeza zotsatira zolondola komanso zachangu pantchito zawo zachitukuko. Big Data. M'magawo otsatirawa, tisanthula njira iliyonse mwatsatanetsatane ndikupereka zitsanzo zothandiza kuti timvetsetse bwino momwe zotsatira zimaphatikizidwira ku Spark.
1. Lowani nawo Ma algorithms Opezeka ku Spark
Spark ndi dongosolo lamakompyuta logawidwa lomwe limapereka mitundu yambiri yophatikizira ma aligorivimu kuti aphatikizire zotsatira za magwiridwe antchito ofanana. Ma aligorivimuwa adapangidwa kuti akwaniritse bwino komanso kusasinthika m'malo akuluakulu a data. Pansipa pali ena mwa ma algorithms omwe amagwiritsidwa ntchito kwambiri ku Spark:
- Gwirizanitsani: Algorithm iyi imaphatikiza ma data osankhidwa awiri kukhala gulu limodzi losanjidwa. Imagwiritsa ntchito njira yogawanitsa ndikugonjetsa kuti iphatikize bwino deta ndikuwonetsetsa kuti ntchito yophatikizana ikhale yosalala.
- agwirizane: Algorithm yolumikizana imaphatikiza magawo awiri a data kutengera kiyi wamba. Zimagwiritsa ntchito njira monga kugawa ndi kugawanso deta kuti ziwongolere ndondomeko yogwirizanitsa. Algorithm iyi ndiyothandiza kwambiri pakujowina patebuloin Mafunso a SQL.
- GuluByKey: Magulu a algorithm awa amaphatikiza zikhalidwe zomwe zimalumikizidwa ndi kiyi iliyonse kukhala gulu la data. Ndikofunikira makamaka mukafuna kuchita zinthu zophatikizira, monga kuwonjezera kapena kuwerengera, kutengera kiyi yomwe mwapatsidwa.
Ma algorithms olumikizana awa ndi zitsanzo chabe za zosankha zomwe zilipo ku Spark. Iliyonse imakhala ndi phindu lapadera ndipo imatha kugwiritsidwa ntchito m'magawo osiyanasiyana kutengera zomwe mukufuna. Ndikofunikira kumvetsetsa ndikugwiritsa ntchito bwino ma aligorivimuwa kuti muwonetsetse kuti magwiridwe antchito ndi ocheperako pama projekiti a Spark.
2. Kuphatikiza deta njira mu Spark
Iwo alipo zingapo zomwe zimalola kuti ma data osiyanasiyana azilumikizana bwino. Imodzi mwa njira zofala kwambiri ndi kugwirizana njira, zomwe zimalola kuti ma seti a data awiri kapena angapo agwirizane pogwiritsa ntchito kiyi wamba. Njirayi ndi yothandiza makamaka mukafuna kugwirizanitsa deta potengera khalidwe linalake, monga chizindikiritso chapadera. Spark imapereka mitundu yosiyanasiyana yolumikizirana, monga kujowina mkati, kujowina kumanzere, kujowina kumanja ndi kujowina kwathunthu, kuti agwirizane ndi zochitika zosiyanasiyana.
Njira ina yophatikizira deta mu Spark ndi njira yophatikizira. Njirayi imalola kuti deta isakanizidwe ndikuwonjezera zikhalidwe kutengera kiyi wamba. Ndikofunikira makamaka mukafuna kupeza zotsatira zophatikiza, monga kuwerengera kuchuluka, avareji, kuchepera kapena kupitilira kwachinthu china. Spark imapereka ntchito zosiyanasiyana zophatikizira, monga kuchuluka, kuwerengera, avg, min ndi max, zomwe zimapangitsa kuti zikhale zosavuta. Njirayi.
Kuphatikiza pa njira zomwe zatchulidwazi, Spark imaperekanso ntchito pamtanda, zomwe zimalola kuti ma seti awiri a data aziphatikizidwa popanda kiyi wamba. Izi zimapanga kuphatikiza zonse zomwe zingatheke pakati pa zigawo zonse ziwiri ndipo zitha kukhala zothandiza pamilandu monga m'badwo. cha mankhwala Cartesian kapena kupanga seti ya data yoyesa kwambiri. Komabe, chifukwa cha mphamvu yowerengera yofunikira, ntchitozi zitha kukhala zokwera mtengo malinga ndi nthawi yophatikizika komanso zothandizira.
3. Zomwe muyenera kuziganizira mukaphatikiza zotsatira mu Spark
Spark kugawa processing
Ubwino umodzi wodziwika bwino wa Spark ndikutha kukonza ma data ambiri m'njira yogawidwa. Izi zimachitika chifukwa cha injini yopangira kukumbukira komanso kuthekera kwake kugawa ndikugawa ntchito m'magulu a ma node. Mukaphatikiza zotsatira mu Spark, ndikofunikira kukumbukira izi. Ndikofunikira kugawa bwino ntchito pakati pa ma node ndikugwiritsa ntchito bwino zomwe zilipo.
Kusunga deta ndi kulimbikira
Kugwiritsa ntchito kuchepa ndi kulimbikira kwa data ndichinthu china chofunika kuchilingalira pophatikiza zotsatira mu Spark. Pamene opareshoni yachitika, Spark imasunga chotsatiracho mu kukumbukira kapena ku litasiki, kutengera ndi momwe yakonzedwera. Pogwiritsa ntchito caching yoyenera kapena kulimbikira, ndizotheka kusunga deta pamalo opezekapo kuti mudzafunse mafunso ndi mawerengedwe amtsogolo, motero kupewa kuwerengeranso zotsatira kachiwiri. Izi zitha kupititsa patsogolo magwiridwe antchito pophatikiza zotsatira zingapo mu Spark.
Kusankha aligorivimu yoyenera
Kusankha algorithm yoyenera ndi chinthu chofunikiranso pophatikiza zotsatira mu Spark.Kutengera mtundu wa data ndi zotsatira zomwe mukufuna, ma algorithms ena amatha kukhala aluso kuposa ena. Mwachitsanzo, ngati mukufuna kuchita a kupanga magulu o mndandanda za data, mutha kusankha ma aligorivimu oyenera, monga K-njira kapena Logistic Regression, motsatana. Posankha algorithm yoyenera, ndizotheka kuchepetsa nthawi yokonza ndikupeza zotsatira zolondola kwambiri ku Spark.
4. Njira zophatikizira deta mu Spark
Spark ndi makina opangira ma data omwe amagwiritsidwa ntchito kwambiri chifukwa chotha kunyamula ma data ambiri bwino. Chimodzi mwazinthu zazikulu za Spark ndi kuthekera kwake kuphatikiza deta bwino, zomwe ndizofunikira nthawi zambiri zogwiritsidwa ntchito. Pali zingapo zomwe zingagwiritsidwe ntchito kutengera zofuna za polojekiti.
Imodzi mwa njira zodziwika bwino zophatikizira deta ku Spark ndi Funsani, zomwe zimakulolani kuti muphatikize ma data awiri kapena angapo kutengera ndime yofanana. Kujowina kutha kukhala kwamitundu ingapo, kuphatikiza kulowa mkati, kujowina kunja, ndi kumanzere kapena kumanja.Kujowina kulikonse kumakhala ndi mawonekedwe ake ndipo kumagwiritsidwa ntchito kutengera zomwe mukufuna kuphatikiza ndi zotsatira zomwe mukufuna kuphatikiza. kupeza.
Njira ina yabwino yophatikizira deta ku Spark ndi repartitioning. Repartitioning ndi njira yogawanso deta kudutsa gulu la Spark kutengera ndime yayikulu kapena magawo. Izi zitha kukhala zothandiza mukafuna kuphatikiza deta bwino kwambiri pogwiritsa ntchito kujowina pambuyo pake. Repartition akhoza kuchitidwa pogwiritsa ntchito ntchito kugawa ku Spark.
5. Malingaliro a magwiridwe antchito pophatikiza zotsatira mu Spark
Pophatikiza zotsatira mu Spark, m'pofunika kukumbukira zina mwazochita. Izi zimawonetsetsa kuti kuphatikiza ndi kothandiza ndipo sizikhudza nthawi yogwiritsira ntchito. Nawa malingaliro ena kuti muwongolere magwiridwe antchito mukaphatikiza zotsatira mu Spark:
1. Pewani kuchita zinthu mosakanikirana: Sanjani ntchito, monga guluByKey kaya kuchepetsaByKey, ikhoza kukhala yokwera mtengo potengera ntchito, chifukwa imaphatikizapo kusamutsa deta pakati pa ma cluster node. Pofuna kupewa izi, tikulimbikitsidwa kugwiritsa ntchito ma aggregation monga kuchepetsaByKey o guluBy m'malo mwake, pamene amachepetsa kusuntha kwa deta.
2. Gwiritsani ntchito posungira deta yapakatikati: Mukaphatikiza zotsatira mu Spark, data yapakatikati ikhoza kupangidwa yomwe imagwiritsidwa ntchito zingapo. Kuti muwongolere magwiridwe antchito, tikulimbikitsidwa kugwiritsa ntchito the posungira () o limbikira () kusunga deta yapakatikati mu kukumbukira. Izi zimapewa kuwawerengeranso nthawi iliyonse yomwe agwiritsidwa ntchito potsatira.
3. Gwiritsani ntchito mwayi wofananiza: Spark imadziwika ndi kuthekera kwake kofananira, komwe kumalola kuti ntchito zizichitika mofananira pamagulu angapo mgulu. Pophatikiza zotsatira, ndikofunikira kugwiritsa ntchito mwayi wofananirawu. Kuti muchite izi, tikulimbikitsidwa kugwiritsa ntchito zinthu monga mapPartitions o flatMap, zomwe zimalola kuti deta ikonzedwe mofanana mu gawo lililonse la RDD.
6. Kukhathamiritsa kwa kuphatikiza zotsatira mu Spark
Ichi ndi gawo lofunikira kwambiri kuti tipititse patsogolo magwiridwe antchito ndi magwiridwe antchito athu. Ku Spark, tikamachita zinthu monga zosefera, kupanga mapu, kapena kuphatikiza, zotsatira zapakatikati zimasungidwa pamtima kapena pa disk zisanaphatikizidwe. Komabe, malingana ndi kasinthidwe ndi kukula kwa deta, kuphatikiza kumeneku kungakhale kokwera mtengo malinga ndi nthawi ndi chuma.
Kuti akwanitse kuphatikiza uku, Spark amagwiritsa ntchito njira zosiyanasiyana monga kugawa deta komanso kugwiritsa ntchito limodzi. Kugawa deta kumaphatikizapo kugawa deta mu tizidutswa tating'onoting'ono ndikugawa m'malo osiyanasiyana kuti apindule kwambiri ndi zomwe zilipo. Izi zimathandiza kuti node iliyonse igwiritse ntchito kachulukidwe kake ka data palokha komanso mofananira, motero kuchepetsa nthawi yopha.
Mbali ina yofunika ndi kuphedwa kofanana, komwe Spark amagawaniza ntchito m'njira zosiyanasiyana ndikuzichita nthawi imodzi pamagawo osiyanasiyana. Izi zimalola kugwiritsa ntchito moyenera zinthu zogwirira ntchito ndikufulumizitsa kuphatikiza kwa zotsatira. Kuphatikiza apo, Spark ali ndi kuthekera kusinthira zokha kuchuluka kwa ntchito kutengera kukula kwa data ndi kuchuluka kwa node, motero kuwonetsetsa kuti pali kusiyana koyenera pakati pa magwiridwe antchito ndi magwiridwe antchito. Njira zokwaniritsira izi zimathandizira kuwongolera kwambiri nthawi yoyankhira ntchito zathu ku Spark.
7. Malangizo opewera mikangano mukaphatikiza zotsatira mu Spark
:
1. Gwiritsani ntchito njira zoyenera zophatikizira: Pophatikiza zotsatira mu Spark, ndikofunikira kugwiritsa ntchito njira zoyenera kupewa mikangano ndikupeza zotsatira zolondola. Spark imapereka njira zosiyanasiyana zolumikizirana, monga kujowina, mgwirizano, kuphatikiza, pakati pa ena. Ndikofunikira kumvetsetsa kusiyana pakati pa njira iliyonse ndikusankha yoyenera kwambiri pa ntchito yomwe muli nayo. Kuonjezera apo, tikulimbikitsidwa kuti mudziwe bwino magawo ndi zosankha zomwe zilipo pa njira iliyonse, chifukwa zingakhudze momwe ntchito ndi zolondola za zotsatira.
2. Kuyeretsa deta mozama: Musanaphatikize zotsatira ku Spark, ndikofunikira kuyeretsa bwino deta. Izi zimaphatikizapo kuchotsa zinthu zopanda pake, zobwerezabwereza, ndi zakunja, komanso kuthetsa zosagwirizana ndi zosagwirizana. Kuyeretsa koyenera kwa deta kumatsimikizira kukhulupirika ndi kugwirizana kwa zotsatira zophatikizana. Kuonjezera apo, kufufuza khalidwe la deta kuyenera kuchitidwa kuti azindikire zolakwika zomwe zingakhalepo musanaphatikizepo.
3. Sankhani gawo loyenera: Kugawa kwa data ku Spark kumakhudza kwambiri magwiridwe antchito olowa nawo. Ndikoyenera kukhathamiritsa kugawa kwa data musanaphatikize zotsatira, kugawa ma seti a data moyenera komanso moyenera kuti mugwiritse ntchito bwino. Spark imapereka njira zingapo zogawa, monga kugawanitsa ndi partitionBy, zomwe zingagwiritsidwe ntchito kugawa bwino deta. Posankha gawo loyenera, mumapewa zolepheretsa ndikuwongolera magwiridwe antchito onse ophatikiza.
Ndine Sebastián Vidal, mainjiniya apakompyuta omwe amakonda ukadaulo komanso DIY. Komanso, ine ndine mlengi wa tecnobits.com, komwe ndimagawana nawo maphunziro kuti ukadaulo ukhale wofikirika komanso womveka kwa aliyense.