Unifying Inconsistent Evaluation Metrics in Recommender Systems

0 downloads 274 Views 159KB Size Report
data on the internet, users have encounter a wide range of options and often .... in information retrieval systems for a
Unifying Inconsistent Evaluation Metrics in Recommender Systems Maliheh Izadi

Amin Javari

Mahdi Jalilii

Department of Computer Engineering Sharif University of Technology [email protected]

Department of Computer Engineering Sharif University of Technology [email protected]

Department of Computer Engineering Sharif University of Technology School of Electrical and Computer Engineering, RMIT University, Melbourne, Australia [email protected]

ABSTRACT Recommender systems are among the most popular tools used by online community these days. Traditionally, recommender techniques were evaluated using accuracy-based metrics such as precision; however, gradually the need for other qualities including more novel and diverse items emerged. Consequently, researchers started to evaluate their finding with different and often inconsistent metrics and made it nearly impossible to compare the existing approaches properly. It is clear that we need a more unified approach to assess the results of new techniques, and to the best of our knowledge, this problem has not been answered yet in previous studies. In this paper, we proposed a novel and extensible framework for evaluation of recommender systems using maximum bounds of possible measures in different datasets. Finally we provided the results of applying this framework on a set of different recommender algorithms.

Keywords Recommender Systems, Evaluation Diversity, Shannon Entropy, Novelty.

Metrics,

Precision,

1. INTRODUCTION In recent years, due to the ever-increasing volume of available data on the internet, users have encounter a wide range of options and often cannot find what they are interested in within a proper amount of time. Therefore, recommendation systems have emerged to help people find/choose what they look for/prefer easier [1]. The goal of a recommender system (RS) is to predict how much a user would like an item according to her/his profile, history of activities and other available information about users, items or the systems itself, and then provide her with a list of items which she will probably enjoy. In order to achieve its goal, a recommender algorithm can utilize various information sources, including ratings given to items, user’s demographic and social information, items content, features, tags and finally contextual information such as time, location or user’s mood [2]. As expected, with growth in number of RSs, the need to evaluate them properly increases as well. Evaluation of RSs can be conducted offline or online [3, 4]; in an offline analysis, the dataset is collected already, a proportion of ratings are hidden from the RS as a test set, and the rest is given to the RS as the input for rating prediction of unseen items. Then the results are evaluated using one or a set of metrics. Offline analysis has the Copyright is held by the author/owner(s). Workshop on 'Recommender Systems Evaluation: Dimensions and Design' (REDD 2014), held in conjunction with RecSys 2014. October 10, 2014, Silicon Valley, United States.

advantage of being quick; however, the true satisfaction of users cannot be measured when applying this approach. On the other hand, online evaluations are conducted in a live experiment through observing users’ behaviors, recommending items and measuring their satisfaction using their feedback or tracking their acts such as click-through rate. However, conducting an online evaluation is expensive and hard to deploy. Therefore, in this paper, we used the offline approach to evaluate different RSs. As for the evaluation metrics, there are plenty of metrics which assess the system from different aspects of view. First group of these metrics belong to the accuracy measures, such as MAE and RMSE which basically calculate the average deviation of the predicted ratings from the real ones, and classification accuracy metrics e.g. precision, recall and f1 score, which calculate how many of recommended items are liked by users and how many of desired items get recommended to the users. Accuracy metrics have been the most commonly used measures for evaluation of RSs through the last two decades [3, 5-7]; however, they are not sufficient in order to evaluate a RS thoroughly. For instance, consider a RS which recommends popular items to users, the precision of such a system is high as expected, since most people will likely enjoy the popular items suggested by it. Yet its novelty is low since recommended items are mostly already known or even used by users. Moreover, its diversity is low as well, because there are a limited number of popular items recommended to everybody. On the other hand, consider a random recommender which suggests items randomly to the users, the precision of random recommendations will be low, but their diversity and novelty are high due to the randomness of this method. The point is that contrary to information retrieval systems in which the precision of results are highly valued, in a recommender system other factors such as diversity and novelty are important as well. It is clear that using such a system can increase the sellers’ profit while gaining users’ loyalty as well. Therefore, several other metrics other than accuracy have been proposed and used for measuring the performance of recommenders in the past years by different researchers. Among them, one can name diversity, coverage and novelty measures. The inconsistency arises when researchers choose different evaluation metrics to measure their findings. As shown by different studies [8-13], there is a tradeoff between the values of diversity, novelty and accuracy measures, i.e. as the accuracy of a recommender increases, the novelty and diversity of recommendations decreases and vice versa (Note that here we use the popularity-based novelty and an entropy-based approach to diversity and coverage which are described with more details in section 2.1). Moreover, most of the studies that have used both diversity and precision (or precision/novelty) to evaluate

their findings, has treated these measures as individual entities. For instance, consider that while comparing different RSs, precision of a recommender is higher than the other one, but its diversity is usually lower. Indeed, generally it is not likely to find two RSs in which both the precision and diversity of the first one is higher than the second recommender at the same time. Even when considering a random recommender with a well-accepted RS such as a matrix factorization based recommender, although the precision of random recommender is very low, but its diversity and novelty can be much higher, and one cannot choose the best approach easily, considering these three factors. We aim at addressing this problem in the field of recommender systems. To this end, we introduce a novel evaluation framework which integrates these metrics together and measures the performance of recommender algorithms more properly. In order to understand the dependency between these evaluation metrics, we fix the value of precision and computed the maximum possible bounds of diversity and novelty for this value in a given dataset. We then analyze how close the values of diversity and novelty to their maximum bounds are. It is worth mentioning that our main focus is to provide additional information in order to better evaluate recommenders. In the end we suggest an approach for normalizing and combining these measures in order to compare different algorithms easier.

2. METHOD In the first part of this section, we provide a review on the evaluation metrics: precision, diversity/coverage and novelty in the field of RSs. Afterward, we present the proposed approach for computing the maximum values of diversity and novelty at a specified precision, and finally suggest an integrated metric for considering these measures at the same time.

2.1 Evaluation metrics Recommenders can be evaluated using different metrics. We have chosen three measures of precision, diversity and novelty which are reviewed in the following subsections. These metrics are selected since each of them represents different dimensions of RSs evaluation. For example, precision is a good metric among the accuracy and relevance-based measures. Moreover, novelty can be a suitable candidate among metrics which measure unexpectedness and serendipity. Finally entropy-based coverage can cover the notion of both diversity among the recommendation lists and coverage of available items in a system, as well as indicating the concentration of recommendations. Nevertheless, we hope to add more metrics to this framework in the extended version of this work.

2.1.1 Precision Classification accuracy metrics such as precision have been used in information retrieval systems for a long time, to determine how well a system can label a relevant item as a good one and an irrelevant item as a bad one. However, there is an issue when applying these metrics for evaluation of recommenders, and that is how to interpret items that has not yet been rated by the user [3]. In this respect we cannot assume them irrelevant for the user because she might had not seen or known about them yet. Considering this, we predict ratings for the unobserved items and evaluate the top-k items in the list using the ratings which she has given already (it is called precision at k). To compute the precision, we use a diffusion matrix as shown in Table 1, which divides the items into four groups of true positives (relevant items recommended by system), false positives (irrelevant items

recommended by system), true negatives (irrelevant items not recommended by system), and finally false negatives (relevant items not recommended by system). Table 1. Confusion matrix. TP represents true positive, TN represents true negative, FP means false positive and FN means false negatives Recommended

Not recommended

Relevant

TP

FN

Irrelevant

FP

TN

Precision is computed as the probability of recommending relevant items [14],

P

TP . TP  FP

(1)

Therefore the P@k will be calculated as P @k 

TP , k

(2)

where k is the size of recommendation list [3]. To be simple, we will show P@k, with Pk from now on.

2.1.2 Diversity and entropy-based coverage Diversity answers questions such as what percentage of available items in the system are recommended to the users, and how similar are the recommendation lists of different users to each other. One can simply take the hamming distance between two users’ recommendation lists, which uses the intersect of items suggested to users u and u' to compute the similarity between their top-k lists [15]. Eq. (3) calculates this distance, where Lu is the list of items recommended to user u, and |L| is the size of recommendation list which is equal to k.

D = 1-(

( L u , L u ) L

),

(3)

In order to add the concept of item coverage to the diversity, one can apply the Shannon entropy theory. Entropy-based coverage measures what percentage of items are recommended and how evenly they are distributed in the recommendations lists of users using n

EC   pi log 2 pi ,

(4)

i 1

where pi represents the percentage of the recommendation lists that contains item i and n the number of items. In this paper we use entropy-based coverage as the representative of diversity and coverage in RSs.

2.1.3 Novelty Finally, novelty assesses how informative are the recommendations, e.g., if we suggest a popular item to a user, she is going to like it, but most probably she already knows about this product. Therefore, we need to bring new information to users from time to time. Novelty can be computed using the notion of popularity, since they have a reverse relationship, i.e., as the popularity grows the novelty decreases, shown in eq. (5), where m is the number of users and di is the degree of an item (how many times it is rated good by users) [16].

N  log 2 (

m ), di

(5)

So far, we have reviewed the evaluation metrics of precision, entropy-based coverage and novelty. We will discuss the details of proposed model to unify these metrics in the following section.

2.2 The proposed model for unified evaluation of RSs It is clear that precision, novelty and entropy-based coverage capture valuable aspects of a recommender; however, to evaluate a RS properly we need to consider these metrics as an integrated entity. Due to the existing inconsistency between these measures, reaching a balance among them is not a trivial task. For example, consider we have two RSs, A and B, which we have evaluated using precision and novelty. If the results show that precision of the first approach is better than the second one, but its novelty is worse, how can we compare these methods? Let us assume that

Pk (A)  Pk (B), N (A)  N ( B).

(6)

Can we be sure that the novelty of the first RS is always lower than the second one? In other words, is it possible for recommender A to achieve the novelty of recommender B by reducing its precision to Pk(B), or is there a bound? One possible solution would be to consider another evolution metric such as entropy-based coverage along with the inequalities in eq. (6). Suppose we compute and compare entropy-based coverage of the recommendations generated by recommender A and B, and we have

EC (A)  EC ( B).

(7)

Can we select recommender A as the winner without any hesitation? What if the novelty provided by recommender B is much higher than novelty of A? Would that make any difference?

N (A)  N ( B).

(8)

In order to solve this problem we present a framework which enables us to compare recommenders based on a unified evaluation metric. Considering only two metrics, we first compute the recommender’s precision and novelty, then we calculate the highest possible value of novelty that an ideal RS can reach, while maintaining the computed precision. We do the same for entropy-based coverage, and compute the maximum entropy-based coverage available in the system for the same specified precision. Note that the maximum values depend on the structure and number of available ratings in the test set and are independent of any recommender. In other words, the maximum values can be achieved using a perfect recommender algorithm. The next step is to divide the computed entropybased coverage and novelty of RS by the maximum entropybased coverage and novelty values respectively, to normalize them. As for the precision, the precision of recommender will be divided by the maximum possible precision in the system. In the end, we have three normalized values of precision, entropybased coverage and novelty, which can be combined using different methods, and produce a final measure of a recommender’s performance according to these three aspects. Further details about maximization process are provided in the following section.

Our goal is to find the relationship between precision, entropybased coverage and novelty of a RS and present an integrated value representing the performance of a recommender. To this end, we break this triplex set to two duplex subsets of precision/ entropy-based coverage and precision/novelty and maintain the relationship between them by fixing the precision. In other words, we want to change a 2-dimensional problem to a 1dimensional one; therefore, we fix one dimension (precision here), and maximize the other one (entropy-based coverage). In order to fix the precision, we need the number of hits (correct recommendations, also known as true positives (TP)) in the recommendation list of all users. According to the average precision of the recommender, P, shown by eq. (9), we can obtain the total number of hits using eq. (11).

P

1  m

P

m

 hits u

u 1 k

,

hits , k m

hits  P  k  m,

(9) (10) (11)

where m is the number of users and k is the number of recommendations to each user. The next step is to find the maximum values of novelty and entropy-based coverage with a fix number of hits. The proposed method consists of two algorithms to maximize entropy-based coverage and novelty at predefined values of precision, and a combination metric to integrate the results, which are presented in the following subsections respectively.

2.2.1 Maximizing entropy-based coverage In order to maximize entropy-based coverage, we need to present users with the most possible diverse items, which means we need to minimize the similarity between users’ recommendation lists. Since this similarity increases as the number of common items in the recommendation lists grows, the best case scenario is when there are unique items in lists of all users, which means we need at least m×k items to recommend, which is not always possible, due to the ever growing number of members in online systems and finite number of available items. Thus, considering Shannon entropy theory, our solution is to distribute all the available items in the system equally and with minimum number of frequency between users. The maximizing algorithm has two steps, first we need to fill in the hits to get to the specified precision, then continue distributing items equally without making any more hits. Suppose we have 12 hits in a system with 8 users and 5 items, where we recommend 3 items to each user. We want to generate 8×3 recommendations which 12 of them must be correct ones (precision is fixed at 0.5). Therefore, according to our hypothesis, we should recommend each item 24/5 ≈ 5 (optimal number) in order for each items to have an equal chance to be recommended. First, we need to fill in the hits. Suppose that the distribution of hits in the test set are as shown by Table 2. Thus, items’ degree distribution (hits) in the test set is shown in Table 3. One way to generate 24 recommendations is to recommend i2, i3 and i5 to complete the hits and fill in the rest of recommendation lists with un-hit (user, item) pairs, but this would not yield maximum entropy-based coverage. Our solution

is to distribute them as equally as we can. Therefore, we start to fill 12 hits using 5 items, the first item goes to the user who had liked this item in test set, next, the second item goes to the same or another user who had liked it and continuing this procedure, we complete the hits while distributing items as equally as possible. Here, we start with i1 and recommend it to u1, then i2 to u1 or u3 (no difference), then i3 to u4, and just like that we fill in the hits to get to twelve hits. Suppose at the end of first step, the distribution of recommended items is as shown is Table 4.

2.2.2 Maximizing novelty

Users

Hits in the test set

u1

i1, i2, i5

u2

i5

u3

i1, i2, i3, i5

In maximizing novelty at a fixed precision, the basics are the same as the previous method. We have two steps; in the first one we fill in the hits, and in the second one we complete the recommendations. However, fillings starts with items which have the highest novelty in our system. That means we sort the items according to their novelty, and start filling in the specified number of hits using the most novel items. Since the frequency of recommending an item is not important for calculating novelty, we recommend the most novel item to all users who have liked it in the test set. Then, the second most novel item, is recommended to all users who have liked it, and it goes on until the number of hits is finished. Then, for the rest of the recommendations, we start at the most novel item again and recommend it to the users to whom we have not suggest it yet, and this procedure continues based on the sorted list of novel items until the recommendation matrix is complete.

u4

i2, i3

2.2.3 Combining evaluation metrics

u5

i1, i3

u6

i5

u7

i2, i3

u8

i2, i5

By computing the maximum possible values of precision, novelty and entropy-based coverage, we have more information sources to evaluate different algorithms comparing when we only have the absolute values of these metrics. It is worth noting that the main contribution and focus of this paper is to provide this additional information to better evaluate recommenders. Although there are different ways to utilize this new information, we first normalize the absolute values using their respective maximum values and then combine them into one integrated metric in order to simply suggest a possible solution for overall comparison of recommender algorithms.

Table 2. Items liked by users according to the test set.

Table 3. Degree distribution of items in the test set. Items

i1

i2

i3

i4

i5

Degree

3

5

4

0

5

Table 4. Degree distribution of recommended items in the first step. Items

i1

i2

i3

i4

i5

Degree

3

3

3

0

3

In the second step, the rest of recommendation lists (13 recommendations), must be filled with items that are recommended less than the optimal number and not being hit by users (in order to fix precision). Thus, we start from items with lower degrees (here i4), and continue to distribute equally until we recommend (approximately) optimal number of each item (here 5) to users. In the end, items’ degree distribution in the recommendation list would be as shown by Table 5. This procedure should distribute items equally between users; however, we have a control factor here which impacts on how equally we can distribute items in the users’ list. It is logical that as the precision grows, the number of hits increases, and as the hits grows, the distribution of items equally becomes stricter and sometimes impossible, resulting in decreasing of maximum entropy-based coverage. Table 5. Degree distribution of recommended items in the recommendation matrix. Items

i1

i2

i3

i4

i5

Degree

5

5

5

4

5

More specifically in this step, we divide the outputs of these metrics about our target RS by their corresponding maximum values in order to measure how well they have performed regarding the input data, Pk 

Pk

,

(12)

,

(13)

P k max EC

EC 

EC max C

C

,

(14)

,

(15)

C max N

N

N max where Pk-max is the maximum possible precision according to the test set (using the maximum number of hits). Similarly, ECmax, Cmax and Nmax are the maximum possible entropy-based coverage, coverage and novelty in the data. The last step is to combine the results; there are various ways to combine them. We used a simple approach and that is to take their harmonic mean, as in the case with precision and recall in F1 score. For example, with a fixed precision, we can have a unified metric, UM, which relates novelty and coverage to each other, as

UM(EC, N) 

2  EC  N . EC  N

(16)

The advantage here is that we can change the weight of a metric, e.g. if novelty is far more important than entropy-based

coverage in a system, we can change the weights to emphasis on novelty, by using β as in eq. (17), 

EC  N

2

UM( EC,N)  (1   )

2

(  EC )  N

,

(17)

which assigns β times as much weight to the second metric as to the first one. Note that if β = 1, it would be equal to eq. (16). To extend the above concept in order to consider the normalized precision, novelty and entropy-based coverage all together, we can use the generalized form of harmonic mean for x metrics using eq. (18), where mx is an evaluation metric and vx is its value. x 1 1 UM(m1,..., m x)  k  (  ) , i 1 v x

(18)

Therefore for three metrics of precision, novelty and entropybased coverage we will use eq. (19). UM( P k , EC , N)  3 

1 . 1 1 1   P k N EC

(19)

3. Experimental Results We applied the proposed framework on a set of different recommenders and the benchmark data set of Movielens. Further details are provided in the following subsections.

converted to a binary form in which one means the user will like and zero means she will probably dislike it. The data in the test set was converted as well, the threshold here is the average rating of each user, i.e., if a generous user always gives high ratings, her average rating will be higher than a critical user who consistently gives low ratings.

3.1.3 Results Figure 1 shows the maximum possible values of entropy-based coverage, and novelty based on a floating precision in the range of [0-1] for this dataset. According to Figure 1 and as discussed before, by increasing precision, the maximum possible values of entropy-based coverage and novelty of a recommender decrease, which is due to the increasing number of hits we should fill in the recommendation matrix. Clearly, as the precision grows, our freedom to distribute items decreases resulting in reduced maximum values. This happens because we are restricted to use the hits in the test set. Therefore, the reduction of maximum values is dependable on the structure of hits in the data. Considering the relationship between maximum entropy-based coverage and novelty with precision depicted in Figure 1, if a recommender has a precision of 0.8, the maximum possible novelty it can reach is about 4. Therefore, we cannot underestimate its novelty of 3 (as an example), when comparing it with another recommender which has precision of 0.6 and novelty of 4.5. Same pattern happens to the maximum value of entropy-based coverage, however with a lower slope.

3.1.1 Algorithms We compared several recommenders, including random recommender, Item-based Collaborative Filtering (ICF), Regularized Singular Value Decomposition (RSVD), and Probabilistic Matrix Factorization (PMF). Random recommender suggests random items to users as the name implies. ICF finds the similarity between items based on the available ratings given to them by users, then predicts a rating for an unseen item by a user, based on the calculated similarity [17]. RSVD and PMF methods decompose the rating matrix to users and items feature matrices to find a model fitting the input data, and predict based on the extracted pattern [18, 19]. Note that ICF is a memory-based CF, since it uses the whole available data and load it into the memory, while RSVD and PMF are among model-based CF approaches. All the recommenders produce a recommendation matrix with the size of m×k, to recommend k top items to m users, which is given as an input to the evaluation phase. It is worth mentioning that since our focus is on the evaluation of RSs, recommender algorithms themselves are not important per se. This means one can uses this framework for evaluation of other RSs such as content based or hybrid recommenders.

3.1.2 Dataset The benchmark dataset used here, is Movielens-100k, which is freely available to download1. To get more accurate results, users with less than 10 likes in their test set and no rating in train set were omitted2. In order to evaluate recommenders, the data was divided into training and test sets using 0.8 to 0.2 ratio, respectively. Moreover, the recommendation matrix was

1 http://grouplens.org/datasets/movielens/ 2

We recommended 10 items to each user.

Figure 1. Maximum values of entropy-based coverage and novelty for different values of precision.

We implemented four different RSs, divided their novelty and entropy-based coverage by the maximum values respectively, then combined them using equations (16), (17) and (19). Table 6 shows the performance of these recommenders based on three measures of precision, novelty and entropy-based coverage.

Table 6. Evaluation results of implemented RSs.

Metrics/RSs Precision Novelty Entropy-based coverage

Random 0.01 5.02 10.39

ICF 0.10 6.24 8.39

RegSVD 0.15 3.93 4.82

PMF 0.18 2.64 8.32

Table 7 shows the normalized values of precision, novelty and entropy-based coverage for these four RSs using their maximum bounds. All the values are in the range of [0,1], and one can simply compare this measures or combine them using equations (16) to (19) and then compare them. Table 7: Normalized values of computed evaluation metrics for implemented RSs.

Metrics/RSs Precision Novelty Entropy-based coverage

Random 0.01 0.56 0.97

ICF 0.10 0.74 0.78

RegSVD 0.15 0.49 0.45

PMF 0.18 0.34 0.77

Considering ICF and RegSVD recommenders, the first one has better novelty and entropy-based coverage, but with lower precision than the latter. If we use eq. (16) and compare how good these two have performed regarding the novelty and entropy-based coverage of their recommendations (at their respective precisions), we will have

UM ( EC , N ) ICF  0.76,

(20)

UM ( EC , N ) regSVD  0.47, where ICF outperforms RegSVD. In fact, by using this method and neglecting the precision of recommender, we can see that random recommender and ICF has better performance than the two model-base approaches, as shown in Figure 2. However when we add their precision using eq. (19), we will have

UM ( P k , EC ,N) ICF  0.23,

which shows RegSVD has better results than ICF. Thus, with this configuration (settings of algorithm parameters), one should use ICF where novelty and entropy-based coverage of results are much more important than their accuracy, and use RegSVD when all three of these values are much appreciated. Figure 3. presents the results of comparing all four RSs using the proposed framework based on precision, novelty and entropybased coverage metrics. Eq. (17) is useful when between two metrics, we value one more. For example, if precision is not very important or precision of two recommender are close to each other, and we want to decide based on novelty and entropy-based coverage, but we want to emphasize on novelty, we can put more weight on it. Using eq. (17), and setting β = 0.5 to put more weight on novelty vs. entropy-based coverage, we will have 0.5

1 0.8 0.6 0.4 0.2 0 random

ICF

RegSVD

PMF

Figure 2.Comparison of different recommenders using the proposed evaluation framework (considering only novelty and entropy-based coverage).

(21)

UM ( P k , EC ,N) regSVD  0.28,

UM (N, EC ) PMF  0.38,

As Figure 3 indicates, our implementation of model-based techniques outperformed the memory-based approaches on the employed dataset, considering all three metrics. Random recommender has a low precision as expected; however, its high values of novelty and entropy-based coverage helped this approach to be applicable in systems where recommending novel and diverse items without heavy computation complexity are accepted (refer to Figure 2). Note that, we do not claim that we set the best parameters for these algorithms, and we do not intend to introduce a winner here. The main purpose is to present a novel framework for evaluation of recommender systems. One can add another metrics to this framework, change the combination formula, or weight the metrics differently. It is logical that in order to evaluate a RS thoroughly, we need to consider as many important factors as possible in our examinations. It is clear that trying to choose the best evaluation metric is futile, since the competence of a RS is not always dependent on a particular aspect, and is highly related to the context and application of recommendation as well.

0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Random

ICF

RegSVD

PMF

Figure 3. Comparison of different recommenders using the proposed evaluation framework (considering precision, novelty and entropy-based coverage).

(22)

0.5 UM (N, EC )regSVD  0.48.

The result shows that, although entropy-based coverage of PMF is higher, but according to their close precision and higher novelty of RegSVD, it will outperform PMF with the considered policy.

4. CONCLUSION AND FUTURE WORK Nowadays recommender systems are widely used to help users find what they will probably like. These systems have been mostly evaluated using accuracy-based metrics, and by emerging new evaluation metrics such as diversity, coverage and novelty, and the significant inconsistency between their

values, the need to a standard framework to measure the performance of RSs has grown rapidly. In this paper, we proposed a novel framework to unify evaluation of recommenders’ quality, considering more than one metric. We have applied the framework and presented the results of our implementations. Continuing this path and in future work, we aim at expanding our framework to include other evaluation metrics and provide a comprehensive approach for evaluation of recommender algorithms.

5. REFERENCES 1. Resnick, P. and H.R. Varian, Recommender systems. Communications of the ACM, 1997. 40(3): p. 56-58. 2. Bobadilla, J., et al., Recommender systems survey. Knowledge-Based Systems, 2013. 3. Herlocker, J.L., et al., Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems (TOIS), 2004. 22(1): p. 5-53. 4. Gunawardana, A. and G. Shani, A survey of accuracy evaluation metrics of recommendation tasks. The Journal of Machine Learning Research, 2009. 10: p. 2935-2962. 5. Breese, J.S., D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. in Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence. 1998. Morgan Kaufmann Publishers Inc. 6. Sarwar, B., et al., Application of dimensionality reduction in recommender system-a case study, 2000, DTIC Document. 7. Lü, L., et al., Recommender systems. Physics Reports, 2012. 519(1): p. 1-49. 8. Zhou, T., et al., Solving the apparent diversity-accuracy dilemma of recommender systems. Proceedings of the National Academy of Sciences, 2010. 107(10): p. 45114515. 9. McNee, S.M., J. Riedl, and J.A. Konstan. Being accurate is not enough: how accuracy metrics have hurt recommender systems. in CHI'06 extended abstracts on Human factors in computing systems. 2006. ACM. 10. Zhou, T., et al., Accurate and diverse recommendations via eliminating redundant correlations. New Journal of Physics, 2009. 11(12): p. 123008. 11. Liu, J.-G., K. Shi, and Q. Guo, Solving the accuracydiversity dilemma via directed random walks. Physical Review E, 2012. 85(1): p. 016118. 12. Gan, M. and R. Jiang, Improving accuracy and diversity of personalized recommendation through power law adjustments of user similarities. Decision Support Systems, 2013. 55(3): p. 811-821. 13. Javari, A. and M. Jalili, A probabilistic model to resolve diversity–accuracy challenge of recommendation systems. Knowledge and Information Systems, 2014: p. 1-19. 14. Cleverdon, C. and M. Kean, Factors Determining the Performance of Indexing Systems. URL http://dspace. lib. cranfield. ac. uk/handle/1826/863, 1966. 15. Zhou, T., et al., Effect of initial configuration on networkbased recommendation. EPL (Europhysics Letters), 2008. 81(5): p. 58004. 16. Shani, G. and A. Gunawardana, Evaluating recommendation systems, in Recommender systems handbook. 2011, Springer. p. 257-297.

17. Sarwar, B., et al. Item-based collaborative filtering recommendation algorithms. in Proceedings of the 10th international conference on World Wide Web. 2001. ACM. 18. Mnih, A. and R. Salakhutdinov. Probabilistic matrix factorization. in Advances in neural information processing systems. 2007. 19. Koren, Y., R. Bell, and C. Volinsky, Matrix factorization techniques for recommender systems. Computer, 2009. 42(8): p. 30-37.