RISS 학술연구정보서비스

검색
다국어 입력

http://chineseinput.net/에서 pinyin(병음)방식으로 중국어를 변환할 수 있습니다.

변환된 중국어를 복사하여 사용하시면 됩니다.

예시)
  • 中文 을 입력하시려면 zhongwen을 입력하시고 space를누르시면됩니다.
  • 北京 을 입력하시려면 beijing을 입력하시고 space를 누르시면 됩니다.
닫기
    인기검색어 순위 펼치기

    RISS 인기검색어

      검색결과 좁혀 보기

      선택해제

      오늘 본 자료

      • 오늘 본 자료가 없습니다.
      더보기
      • 무료
      • 기관 내 무료
      • 유료
      • Compound Method Based on Frequent Terms for Near Duplicate Documents Detection

        Gaudence Uwamahoro,Zhang Zuping,Ambele Robert Mtafya,Jun Long 보안공학연구지원센터 2014 International Journal of Database Theory and Appli Vol.7 No.6

        Examining data to find similar data is a major problem in data mining and information retrieval. There are abundant documents that contain information. Most of those documents are duplicates or near duplicates and they increase storage space and cost time for searching for information needed. Reduction of dimensionality and well organization of data are the ways that can be used to solve the problem of efficiency. In this paper we proposed a method based mined frequent terms from each document to reduce the data size and efficient method for clustering documents that have close similarity between them. Using our method only 36.4% of original size has been used. The similarity between documents is based on frequent terms shared. Our method performs well on running time of O(n) whereas the current methods for clustering require O(n3).

      • Efficient Pairwise Document Similarity Computation in Big Datasets

        Papias Niyigena,Zhang Zuping,Weiqi Li,Jun Long 보안공학연구지원센터 2015 International Journal of Database Theory and Appli Vol.8 No.4

        Document similarity is a common task to a variety of problems such as clustering, unsupervised learning and text retrieval. It has been seen that document with the very similar content provides little or no new information to the user. This work tackles this problem focusing on detecting near duplicates documents in large corpora. In this paper, we are presenting a new method to compute pairwise document similarity in a corpus which will reduce the time execution and save space execution resources. Our method group shingles of all documents of a corpus in a relation, with an advantage of efficiently manage up to millions of records and ease counting and aggregating. Three algorithms are introduced to reduce the candidates shingles to be compared: one creates the relation of shingles to be considered, the second one creates the set of triples and the third one gives the similarity of documents by efficiently counting the shared shingles between documents. The experiment results show that our method reduces the number of candidates pairs to be compared from which reduce also the execution time and space compared with existing algorithms which consider the computation of all pairs candidates.

      • Near Duplicate Document Detection using Document Image

        Gaudence Uwamahoro,Zhang Zuping,Ambele Robert Mtafya,Weiqi Li,Long Jun 보안공학연구지원센터 2016 International Journal of Multimedia and Ubiquitous Vol.11 No.7

        With development, access of Internet has allowed storage of huge documents containing information. Identifying near duplicate documents among those documents is a major problem in information retrieval due to their dimensionality which leads to high cost time. We propose an algorithm based on tf-idf method with importance and discriminative power of a term within a single document to speed up search process for detecting how documents are similar in collection. Using only 26.6% of original document size, our method performs well on efficiency and memory usage as we have reduced compare to the original one and that leads to a decreased time in searching process for similar documents in a collection.

      • Efficient Document Similarity Detection Using Weighted Phrase Indexing

        Papias Niyigena,Zhang Zuping,Mansoor Ahmed Khuhro,Damien Hanyurwimfura 보안공학연구지원센터 2016 International Journal of Multimedia and Ubiquitous Vol.11 No.5

        Document similarity techniques mostly rely on single term analysis of the document in the data set. To improve the efficiency and effectiveness of the process of document similarity detection, more informative feature terms have been developed and presented by many researchers. In this paper, we present phrase weight index, which indexes documents in the data set based on important phrases. Phrasal indexing aims to reduce the ambiguity inherent to the words considered in isolation, and then improve the effectiveness in document similarity computation. The method we are presenting here in this paper inherit the term tf-idf weighting scheme in computing important phrases in the collection. It computes the weight of phrases in the document collection and according to a given threshold; the important phrases are identified and are indexed. The data dimensionality which hinders the performance of document similarity for different methods is solved by an offline index creation of important phrases for every document. The evaluation experiments indicate that the presented method is very effective on document similarity detection and its quality surpasses the traditional phrase-based approach in which the reduction of dimensionality is ignored and other methods which use single-word tf-idf.

      • A Lexicon-based Approach for Hate Speech Detection

        Njagi Dennis Gitari,Zhang Zuping,Hanyurwimfura Damien,Jun Long 보안공학연구지원센터 2015 International Journal of Multimedia and Ubiquitous Vol.10 No.4

        We explore the idea of creating a classifier that can be used to detect presence of hate speech in web discourses such as web forums and blogs. In this work, hate speech problem is abstracted into three main thematic areas of race, nationality and religion. The goal of our research is to create a model classifier that uses sentiment analysis techniques and in particular subjectivity detection to not only detect that a given sentence is subjective but also to identify and rate the polarity of sentiment expressions. We begin by whittling down the document size by removing objective sentences. Then, using subjectivity and semantic features related to hate speech, we create a lexicon that is employed to build a classifier for hate speech detection. Experiments with a hate corpus show significant practical application for a real-world web discourse.

      • Detecting Polarizing Language in Twitter using Topic Models and ML Algorithms

        Njagi Dennis Gitari,Zhang Zuping,Wandabwa Herman 보안공학연구지원센터 2016 International Journal of Hybrid Information Techno Vol.9 No.9

        The upsurge in the use of social media in public discourses has made it possible for social scientists to engage in emerging and interesting areas of research. Normally, public debates tend to assume polar positions along political, social or ideological lines. Generally, polarity in the language used is more of blaming the opposing group in such debates. In this paper, we investigated the detection of polarizing language in tweets in the event of a disaster. Our approach entails combining topic modeling and Machine Learning (ML) algorithms to generate topics that we consider to be polarized thereby classifying a given tweet as polar or not. Our latent Dirichlet allocation (LDA)-based model incorporates external resources in the form of a lexicon of blame-oriented words to induce the generation of polar topics. The Collapsed Gibbs sampling is used to infer new documents and to estimate the values of parameters employed in our model. We computed the log likelihood (LL) ratios using our model and two other state-of-the-art LDA-based models for evaluation. Furthermore, we compared polarized detection classification accuracy using the features extracted from polarized topics, bag of words (BOW) and part of speech (POS)-based features. Preliminary experiments returned higher overall accuracy results of 87.67% using topic-based features compared to BOW and POS-based features.

      • KCI등재

        Latent Semantic Analysis Approach for Document Summarization Based on Word Embeddings

        ( Kamal Al-sabahi ),( Zhang Zuping ),( Yang Kang ) 한국인터넷정보학회 2019 KSII Transactions on Internet and Information Syst Vol.13 No.1

        Since the amount of information on the internet is growing rapidly, it is not easy for a user to find relevant information for his/her query. To tackle this issue, the researchers are paying much attention to Document Summarization. The key point in any successful document summarizer is a good document representation. The traditional approaches based on word overlapping mostly fail to produce that kind of representation. Word embedding has shown good performance allowing words to match on a semantic level. Naively concatenating word embeddings makes common words dominant which in turn diminish the representation quality. In this paper, we employ word embeddings to improve the weighting schemes for calculating the Latent Semantic Analysis input matrix. Two embedding-based weighting schemes are proposed and then combined to calculate the values of this matrix. They are modified versions of the augment weight and the entropy frequency that combine the strength of traditional weighting schemes and word embedding. The proposed approach is evaluated on three English datasets, DUC 2002, DUC 2004 and Multilingual 2015 Single-document Summarization. Experimental results on the three datasets show that the proposed model achieved competitive performance compared to the state-of-the-art leading to a conclusion that it provides a better document representation and a better document summary as a result.

      연관 검색어 추천

      이 검색어로 많이 본 자료

      활용도 높은 자료

      해외이동버튼