A Data Science Course Project About Crop Yield and Price Prediction I’m Still Not Ashamed Of | by Mikhail Sarafanov

Vidnoz Pricing, Pros Cons, Features, Alternatives

Researchers at Princeton University Proposes Edge Pruning: An Effective and Scalable Method for Automated Circuit Finding

New and improved camera inspired by the human eye

Howdy, pricey reader! Throughout these Christmas holidays, I skilled a sense of nostalgia for the previous pupil years. That’s why I made a decision to put in writing a put up a few pupil challenge that was achieved nearly 4 years in the past as a challenge on the course “Strategies and fashions for Multivariate information evaluation” throughout my Grasp’s diploma in ITMO College.

Disclaimer: I made a decision to put in writing this put up for 2 causes:

to share an strategy to organizing college research that has confirmed to be very efficient (a minimum of for me);to encourage people who find themselves simply beginning to examine programming and/or statistics to try to experiment with their pet or course initiatives, as a result of generally such initiatives are memorable for a few years and surprisingly satisfying

The article mentions, in suggestions format, good practices that I’ve been capable of apply throughout course challenge.

Starting of the story

So, in the beginning of the course, we have been knowledgeable that college students may type groups of two–3 folks on our personal and suggest a course challenge that we’d current on the finish of the course. Through the studying course of (about 5 months), we are going to make intermediate displays to our lecturers. This fashion, the professors can see how the progress is (or isn’t) occurring.

After that, I instantly teamed up with my dudes: Egor and Camilo (simply because we knew find out how to have enjoyable collectively), and we began fascinated with the subject…

Selecting the subject

I steered selecting

a theme that was sufficiently big that we may work independently on completely different components of itthe area which was near our pursuits (geographic info evaluation for me and economics for my colleagues)

So, it was…

Determine 1. Chosen subject (picture by writer)

Camilo additionally wished to attempt to make dashboards with visualisations (utilizing PowerBI), however just about any job can be appropriate for this need.

Tip 1: Select a subject that you just (and your colleagues) shall be obsessed with. It might not be the best challenge on a subject that isn’t extremely popular, however you’ll get pleasure from spending your “evenings” engaged on it

What was the primary concept

The course consisted of a giant variety of subjects every of which is a set of strategies for statistical evaluation. We determined that we’d attempt to forecast yield and crop value in as many various methods as doable after which ensemble the forecasts utilizing some statistical methodology. This allowed us to strive many of the strategies mentioned within the course in follow.

Additionally, the spatio-temporal information was actually multidimensional — this associated fairly properly to the primary theme of the course.

Spoiler: all of us received a rating 5 out of 5

Analysis (& information sources)

We began with a literature assessment to know precisely how crop yield and crop value are predicted. We additionally wished to know what sort of forecast error could possibly be thought of passable.

I can’t cite on this put up the thesis ensuing from this assessment. I’ll merely point out that we determined to make use of the next metric and threshold to guage the standard of the answer (for each crop yield and crop value):

Acceptable efficiency: Imply absolute proportion error (MAPE) for a fairly good forecast mustn’t exceed 10%

2 tip: Begin your challenge (regardless of at work or throughout your research) with a assessment of up to date options. Possibly the issue you’re looking at now has already been solved.

3 tip: Earlier than beginning a growth, decide what metric you’ll use to guage the answer. Keep in mind, you possibly can’t enhance what you possibly can’t measure.

Going again to the analysis, we’ve recognized the next information sources (Hyperlinks are updated at twenty eighth of December 2023):

Why these sources? — We now have assumed that the value of a crop will rely on the quantity of product produced. And in agriculture, the amount produced relies on climate circumstances.

The mannequin was applied for:

Yield of wheat, rice, maize, and barley;Nations: Germany, France, Italy, Romania, Poland, Austria, the Netherlands, Switzerland, Spain and the Czech Republic.

Local weather information preprocessing

So, we’ve began with an assumption: “Wheat, rice, maize, and barley yields rely on climate circumstances within the first half of the yr (till 30 June)” (Determine 2)

Determine 2. Era of traits for yield prediction (utilizing the sum of lively temperatures for instance) (picture by writer)

The supply archives obtained from the European area Company web site include netCDF recordsdata. The recordsdata have each day fields for the next parameters:

Imply each day air temperature, ℃Minimal each day air temperature, ℃Most each day air temperature, ℃Stress, HPaPrecipitation, mm

Based mostly on the preliminary fields, the next parameters for the primary half of every yr have been calculated:

Whole rainfall for the primary half of the yr, mm (see Animation);The variety of days with precipitation for the primary half of the yr, days;Common strain, hpa;Most common each day air temperature for the primary six months, ℃;Minimal common each day temperature for the primary six months, ℃;The sum of lively temperatures above 10 levels Celsius, ℃ (see Determine 3).

Rainfall for the first half of the year, mm — Animation. Rainfall for the primary half of the yr, mm. The yr is listed on the prime of the chart (by writer)

Determine 3. Sum of lively temperatures for the primary half of the yr 1950, °C (picture by writer)

Thus we obtained matrices for the entire territory of Europe with calculated options for the longer term mannequin(s). The reader might discover that I calculate such a parameter as “The sum of lively temperatures above 10 levels Celsius”. That is such a well-liked parameter in ecology and botany that helps to find out the temperature optimums for various species of organisms (primarily vegetation, for instance “The sum of lively temperatures as a way of figuring out the optimum harvest date of ‘S̆ampion’ and ‘Ligol’ apple cultivars”)

4 tip: If in case you have experience within the area (which isn’t associated to Information Science), ensure you use it within the challenge — present that you’re not solely making a “fit-predict” but additionally adapting and enhancing domain-specific approaches

The following step is Aggregation of knowledge by nation. For values from the meteorological parameter matrices have been extracted for every nation individually (Determine 4).

Determine 4. Matrix with the boundaries of nations (picture by writer)

I might notice that this technique made sense (Determine 5): For instance, the image exhibits that for Spain, wheat yields are nearly unaffected by the sum of lively temperatures. Nevertheless, for the Czech Republic, a warmer first half of the yr is extra prone to lead to decrease yields. It’s subsequently a good suggestion to mannequin yields individually for every nation.

Determine 5. Dependence of wheat yield (tonnes per hectare) on the sum of lively temperatures (picture by writer)

Not the entire nation’s territory is appropriate for agriculture. Subsequently, it was essential to mixture info solely from sure pixels. With a view to account for the placement of agricultural land, the next matrix was ready (Determine 6).

Determine 6. Land use matrix (picture by writer)

1. The subject of the lecture is: univariate statistical testing

So, we’ve received the info prepared. Nevertheless, agriculture is a really advanced trade that has improved markedly yr by yr, decade by decade. It could make sense to restrict the coaching pattern for the mannequin. For this function, we used the cumulative sum methodology (Determine 7):

Cumulative sum methodology:To every quantity from the pattern, successive numbers are added sequentially to the next. That’s, if the pattern contains solely three years: 1950, 1951, and 1952, the quantity for 1950 shall be plotted on the Y-axis for 1950, and 1951 will present the sum of 1950 and 1951, and many others.

– If the form of the road is near a straight line and there are not any fractures, the pattern is homogeneous

– If the form of the road has fractures the pattern is split into 2 components primarily based on this fracture

Determine 7. France. Comparability of goal variable throughout years: Wheat (tonnes per hectare) (picture by writer)

If a fracture was detected, we in contrast the 2 samples for belonging to the overall inhabitants (Kolmogorov-Smirnov statistic). If the samples have been statistically considerably completely different, we used the second half to coach the mannequin for prediction. If not, we used the complete pattern.

5 tip: Don’t be afraid to mix approaches to statistical evaluation (it’s a course challenge!). For instance, within the lecture we weren’t advised in regards to the cumulative sums methodology — the subject was about evaluating distributions. Nevertheless, I’ve beforehand used this strategy to match developments in ice circumstances throughout the processing of ice maps. It appeared to me that it could possibly be helpful right here as properly

I ought to notice right here that we’ve assumed that the method is ergodic, so we determined to match on this manner.

So, after the preparation, we’re prepared to start out constructing statistical fashions — let’s check out probably the most fascinating half!

2. The subject of the lecture is: multivariate regression

The next options was included within the mannequin:

Whole rainfall;The variety of days with precipitation;The sum of lively temperatures above 10 ℃;Imply strain;Minimal air temperature ℃.

Goal variables: Yield of wheat, rice, maize, and barley

Validation years: 2008–2018 for every nation

Let’s transfer on to the visualisations to make it somewhat clearer.

Determine 8. Generated surfaces with predictions for every nation for a particular yr primarily based on initialized fashions (picture by writer)

And right here is Determine 9 displaying the residuals (residual = noticed worth -estimated (predicted) worth) from the linear mannequin for France and Italy:

Determine 9. Visualisation of residuals and metrics for linear regression on the validation pattern (picture by writer)

It may be seen from the graphs that the metric is passable, however the error distribution is biased from zero — because of this the mannequin has systematic error. We tried to appropriate within the new fashions beneath

Validation pattern MAPE metric worth: 10.42%

6 tip: Begin with the best fashions (e.g. linear regression). This gives you a baseline in opposition to which you’ll be able to evaluate improved variations of the mannequin. The less complicated the mannequin, the higher it’s, so long as it exhibits a passable metric

3. The subject of the lecture is: multivariate distributions evaluation

We’ve turned the fabric from this lecture right into a mannequin “Distribution evaluation”. The idea was easy — we analysed the distributions of climatic parameters for every year and for the present yr and located an analogue yr of the present one to foretell the worth of yield precisely the identical as that of the identified previously (Determine 10).

Determine 10. The idea of pairwise comparability for the aim of choosing a year-analogue (picture by writer)

Concept: Yields for years with comparable climate circumstances shall be comparable

The strategy: Pairwise comparability of temperature, precipitation, and strain distributions. Prediction-yield for the yr that’s most just like the thought of one

Distributions used:

Temperature for the primary half of the yr, temperature for the months: February, April, June;Precipitation for the primary half of the yr, precipitation for the months: February, April, June;Stress for the primary half of the yr, strain for the months: February, April, June.

For comparability of distributions we used Kruskal-Wallis check. To regulate p-value, a a number of testing correction is launched — the Bonferroni correction.

Determine 11. Obtained distributions for 2000 and 2018 years for air temperature (picture by writer)

Validation pattern MAPE metric worth: 13.80%

7 tip: If you’re doing a number of statistical testing, don’t neglect to incorporate the correction (for instance, Bonferroni correction)

4. The subject of the lecture is: Bayesian community

One of many lectures was centered on the Bayesian networks. Subsequently, we determined to adapt the strategy for yield prediction. We thought of that every yr is described by a set of group of variables A, B, C and many others. the place A is a set of classes describing crop yields, B is, for instance, the Sum of lively temperatures circumstances and so forth. A, for instance, may take solely three values: “Excessive crop yield”, “Medium crop yield”, “Low crop yield”. The identical for B and C and others. Thus, if we categorise the circumstances and the goal variable, we receive the next description of every yr:

1950 — “Excessive warmth provide”, “Low rainfall provide”, “Excessive atmospheric strain”— “Excessive crop yield”1951 — “Low warmth provide”, “Excessive rainfall provide”, “Excessive atmospheric strain” — “Medium crop yield”1952 — “Low warmth provide”, “Low rainfall provide”, “Excessive atmospheric strain” — Which crop yeild?

The algorithm was designed to foretell a yield class primarily based on a mix of three different classes:

Crop yield (3 classes) — hidden state — goal variableSum of lively temperatures (3 classes)Rainfall (3 classes)Imply strain (3 classes)

How can we outline these classes? — through the use of a clustering algorithm! For instance, the next 3 clusters have been recognized for wheat yields

Determine 12. Wheat yield clusters for Bayesian community analisys (picture by writer)

The ultimate forecast of this mannequin — the common yield of the expected cluster.

Validation pattern MAPE metric worth: 14.55%

8 tip: Do experiment! Bayesian networks with clustering for time collection forecasting? — Certain! Pairwise evaluation of distributions — Why not? Typically the boldest approaches result in important enhancements

5. The subject of the lecture is: Time collection forecasting

After all, we are able to forecast the goal variable as a time collection. Our job right here was to know how classical forecasting strategies work in principle and follow.

Placing this methodology into follow proved to be the best. For instance, in Python there are a number of libraries that enable to customize and apply the ARIMA mannequin, for instance pmdarima.

Determine 13. Software of ARIMA mannequin to foretell yield time collection. X-axis: time index, Y-axis: Barley yield (picture by writer)

Validation pattern MAPE metric worth: 10.41%

9 tip: Don’t neglect the comparability with classical approaches. An summary metric is not going to inform your colleague a lot about how good your mannequin is, however a comparability with well-known requirements will present the true stage of efficiency

6. The subject of the lecture is: Ensembling

After all of the fashions have been constructed, we explored precisely how every mannequin is “mistaken” (bear in mind residual plots for linear regression mannequin — see Determine 9):

Determine 14. Residuals plot for various crop yield forecasting fashions (picture by writer)

Not one of the offered algorithms allowed to beat the ten% threshold (in keeping with MAPE).

The Kalman filter was used to enhance the standard of the forecast (to ensemble it). Passable outcomes have been achieved for some international locations (Determine 15)

Determine 15. Crop yield predictions for various international locations utilizing ensemble (picture by writer)

Validation pattern MAPE metric worth: 9.89%

10 tip: If I have been requested to combine the developed mannequin into Manufacturing service, I might combine both ARIMA or linear regression, though the ensemble metric is healthier. Nevertheless, metrics in enterprise issues are generally not the important thing. A standalone mannequin is typically higher than an ensemble as a result of it’s less complicated and extra dependable (even when the error metric is barely greater)

Futures value prediction

And the ultimate half: mannequin (lasso regression), which used predicted yield values and Futures options to estimate doable value values (Determine 16):

Determine 16. Wheat Futures costs prediction (picture by writer)

Mape on validation pattern: 6.61%

Why I nonetheless assume this challenge is an effective one

In order that’s the top of the story. Above there have been posted a few of suggestions. And within the final paragraph, I need to summarise the ultimate level and say why I’m happy with that challenge. Listed here are three foremost objects:

Organisation of labor and selection of subject — we mixed our strengths and finest qualities very properly, deliberate the work stack properly and managed as a crew to arrange a very good challenge and ship it on time. So, I’ve improved my smooth expertise;Significant theme — I used to be obsessed with what we have been doing. Even when I had a number of weeks free now for a pet challenge, I might fortunately apply my present expertise and expertise to such a case examine once more. So, I used to be happy with the work we had achieved;Arduous expertise — Throughout our work we’ve tried new statistical strategies, improved our understanding of already acquainted ones, and enhanced our programming expertise.

Properly, we additionally received nice marks on the examination XD

I hope your initiatives at college and elsewhere shall be as thrilling for you. Glad New Yr!

Sincerely yours, Mikhail Sarafanov

Source link

A Data Science Course Project About Crop Yield and Price Prediction I’m Still Not Ashamed Of | by Mikhail Sarafanov | Dec, 2023

Vidnoz Pricing, Pros Cons, Features, Alternatives

Researchers at Princeton University Proposes Edge Pruning: An Effective and Scalable Method for Automated Circuit Finding

New and improved camera inspired by the human eye

Purdue Researchers Utilize Deep Learning and Topological Data Analysis for Advanced Model Interpretation and Precision in Complex Predictions

The 5 biggest robotics industry trends of 2023

Recommended For You

Vidnoz Pricing, Pros Cons, Features, Alternatives

Researchers at Princeton University Proposes Edge Pruning: An Effective and Scalable Method for Automated Circuit Finding

New and improved camera inspired by the human eye

Build a self-service digital assistant using Amazon Lex and Knowledge Bases for Amazon Bedrock

Mastering SQL Optimization: From Functional to Efficient Queries | by Yu Dong | Jul, 2024

The 5 biggest robotics industry trends of 2023

Pipedream deploys underground robotic delivery system in Georgia

Robot elevator swing testing! - FRC 2023

Leave a Reply Cancel reply

Amazon Reports Record Q1 2024 Earnings and Launches Amazon Q Assistant

Robots-Blog | AMBER Lucid ONE, first choice for bioinspired Robot’s arm, launches on Kickstarter

Meet LangGraph: An AI Library for Building Stateful, Multi-Actor Applications with LLMs Built on Top of LangChain

Living Forever Through AI: Digital Immortality and the Future of Death | ENDEVR Documentary

10 TERRIFYING Military Robots That Really Exist

October 2023 Robotics Investments Equals $980 Million

NVIDIA’s AI: Virtual Worlds, Now 10,000x Faster!

Training AI to Play Pokemon with Reinforcement Learning

Softing Industrial Expands edgeConnector Deployment Options With ARM 32-Bit Compatibility

6 ways Google AI makes your Pixel even more helpful

Vidnoz Pricing, Pros Cons, Features, Alternatives

Figure 01 humanoid trains for its first job assembling BMWs

Researchers at Princeton University Proposes Edge Pruning: An Effective and Scalable Method for Automated Circuit Finding

Top 10 robotics stories of June 2024

CATEGORIES

SITE MAP

Welcome Back!

Retrieve your password

A Data Science Course Project About Crop Yield and Price Prediction I’m Still Not Ashamed Of | by Mikhail Sarafanov | Dec, 2023

You might also like

Starting of the story

Selecting the subject

What was the primary concept

Analysis (& information sources)

Local weather information preprocessing

1. The subject of the lecture is: univariate statistical testing

2. The subject of the lecture is: multivariate regression

3. The subject of the lecture is: multivariate distributions evaluation

4. The subject of the lecture is: Bayesian community

5. The subject of the lecture is: Time collection forecasting

6. The subject of the lecture is: Ensembling

Futures value prediction

Why I nonetheless assume this challenge is an effective one

Purdue Researchers Utilize Deep Learning and Topological Data Analysis for Advanced Model Interpretation and Precision in Complex Predictions

The 5 biggest robotics industry trends of 2023

Recommended For You

Leave a Reply Cancel reply

CATEGORIES

SITE MAP

Welcome Back!

Retrieve your password