Package - kbenoit/quanteda

quanteda: quantitative analysis of textual data

CRAN Version Downloads Total Downloads Travis-CI Build Status Appveyor Build status codecov DOI

About

An R package for managing and analyzing text, created by Kenneth Benoit in collaboration with a team of core contributors: Kohei Watanabe, Paul Nulty, Adam Obeng, Haiyan Wang, Ben Lauderdale, and Will Lowe.
Supported by the European Research Council grant ERC-2011-StG 283794-QUANTESS.

For more details, see http://quanteda.io.

How to cite the package:


Benoit K (2017). _quanteda: Quantitative Analysis of Textual
Data_. doi: 10.5281/zenodo.1004683 (URL:
http://doi.org/10.5281/zenodo.1004683), R package version 0.99.22,
<URL: http://quanteda.io>.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {quanteda: Quantitative Analysis of Textual Data},
    author = {Kenneth Benoit},
    year = {2017},
    doi = {10.5281/zenodo.1004683},
    url = {http://quanteda.io},
    note = {R package version 0.99.22},
  }

How to Install

  1. From CRAN: Use your GUI’s R package installer, or execute:

    install.packages("quanteda") 
    
  2. From GitHub, using:

    # devtools packaged required to install quanteda from Github 
    devtools::install_github("kbenoit/quanteda") 
    

    Because this compiles some C++ source code, you will need a compiler installed. If you are using a Windows platform, this means you will need also to install the Rtools software available from CRAN. If you are using macOS, you will need to to install XCode, available for free from the App Store, or if you prefer a lighter footprint set of tools, just the Xcode command line tools, using the command xcode-select --install from the Terminal.

    Also, you might need to upgrade your compiler. @kbenoit found that his macOS build only worked reliably after upgrading the default Xcode compiler to clang4, following these instructions.

  3. Additional recommended packages:

    The following packages work well with or extend quanteda and we recommend that you also install them:

    • readtext: An easy way to read text data into R, from almost any input format.

    • spacyr: NLP using the spaCy library, including part-of-speech tagging, entity recognition, and dependency parsing.

    • quantedaData: Additional textual data for use with quanteda.

      devtools::install_github("kbenoit/quantedaData")
      
    • LIWCalike: An R implementation of the Linguistic Inquiry and Word Count approach to text analysis.

      devtools::install_github("kbenoit/LIWCalike")
      

Leaving feedback

If you like quanteda, please consider leaving feedback or a testimonial here.

Contributing

Contributions in the form of feedback, comments, code, and bug reports are most welcome. How to contribute:

Github

link
Stars: 298

Advertisement

Releases

v0.99.22 - Nov 13, 2017

New Features

  • tokens_select() has a new window argument, permitting selection within an asymmetric window around the pattern of selection. (#521)
  • tokens_replace() now allows token types to be substituted directly and quickly.
  • Added a spacy_parse method for corpus objects. Also restored quanteda methods for spacyr spacy_parsed objects.

Bug fixes and stability enhancements

  • Improved documentation for textmodel_nb() (#1010), and made output quantities from the fitted NB model regular matrix objects instead of Matrix classes.

Behaviour Changes

  • All of the deprecated functions are now removed. (#991)
  • tokens_group() is now significantly faster.
  • The deprecated "list of characters" tokenize() function and all methods associated with the tokenizedTexts object types have been removed.
  • Added convenience functions for keeping tokens or features: tokens_keep(), dfm_keep(), and fcm_keep(). (#1037)
  • textmodel_NB() has been replaced by textmodel_nb().

v0.99.12 - Oct 8, 2017

Changes since v0.99.9

New Features

  • Added methods for changing the docnames of tokens and dfm objects (#987).

Bug fixes and stability enhancements

  • The computation of tfidf has been more thoroughly described in the documentation for this function (#997).
  • Now depends on R >= 3.4.0, to avoid showing errors in r-oldrelease builds.

v.99.9 - Sep 23, 2017

Changes since v0.99

New Features

  • Added magrittr pipe support (#927). %>% can now be used with quanteda without needing to attach magrittr (or, as many users apparently believe, the entire tidyverse.)
  • corpus_segment() now behaves more logically and flexibly, and is clearly differentiated from corpus_reshape() in terms of its functionality. Its documentation is also vastly improved. (#908)
  • Added data_dictionary_LSD2015, the Lexicoder Sentiment 2015 dictionary (#963).
  • Significant improvements to the performance of tokens_lookup() and dfm_lookup() (#960).
  • New functions head.corpus(), tail.corpus() provide fast subsetting of the first or last documents in a corpus. (#952)

Bug fixes and stability enhancements

  • Fixed a problem when applying purrr::map() to dfm() (#928).
  • Added documentation for regex2fixed() and associated functions.
  • Fixed a bug in textstat_collocations.tokens() caused by "documents" containing only "" as tokens. (#940)
  • Fixed a bug caused by cbind.dfm() when features shared a name starting with quanteda_options("base_featname") (#946)
  • Improved dictionary handling and creation now correctly handles nested LIWC 2015 categories. (#941)
  • Number of threads now set correctly by quanteda_options(). (#966)

Behaviour changes

  • summary.corpus() now generates a special data.frame, which has its own print method, rather than requiring verbose = FALSE to suppress output (#926).
  • textstat_collocations() is now multi-threaded.
  • head.dfm(), tail.dfm() now behave consistently with base R methods for matrix, with the added argument nfeature. Previously, these methods printed the subset and invisibly returned it. Now, they simply return the subset. (#952)

v0.99 - Aug 16, 2017

New features

  • Improvements and consoldiation of methods for detecting multi-word expressions, now active only through textstat_collocations(), which computes only the lambda method for now, but does so accurately and efficiently. (#753, #803). This function is still under development and likely to change further.
  • Added new quanteda_options that affect the maximum documents and features displayed by the dfm print method (#756).
  • ngram formation is now significantly faster, including with skips (skipgrams).
  • Improvements to topfeatures():
    • now accepts a groups argument that can be used to generate lists of top (or bottom) features in a group of texts, including by document (#336).
    • new argument scheme that takes the default of (frequency) "count" but also a new "docfreq" value (#408).
  • New wrapper phrase() converts whitespace-separated multi-word patterns into a list of patterns. This affects the feature/pattern matching in tokens/dfm_select/remove, tokens_compound, tokens/dfm_lookup, and kwic. phrase() and the associated changes also make the behaviour of using character vectors, lists of characters, dictionaries, and collocation objects for pattern matches far more consistent. (See #820, #787, #740, #837, #836, #838)
  • corpus.Corpus() for creating a corpus from a tm Corpus now works with more complex objects that include document-level variables, such as data from the manifestoR package (#849).
  • New plot function textplot_keyness() plots term "keyness", the association of words with contrasting classes as measured by textstat_keyness().
  • Added corpus constructor for corpus objects (#690).
  • Added dictionary constructor for dictionary objects (#690).
  • Added a tokens constructor for tokens objects (#690), including updates to tokens() that improve the consistency and efficiency of the tokenization.
  • Added new quanteda_options(): language_stemmer and language_stopwords, now used for default in *_wordstem functions and stopwords() for defaults, respectively. Also uses this option in dfm() when stem = TRUE, rather than hard-wiring in the "english" stemmer (#386).
  • Added a new function textstat_frequency() to compile feature frequencies, possibly by groups. (#825)
  • Added nomatch option to tokens_lookup() and dfm_lookup(), to provide tokens or feature counts for categories not matched to any dictionary key. (#496)

Behaviour changes

  • The functions sequences() and collocations() have been removed and replaced by textstat_collocations().
  • (Finally) we added "will" to the list of English stopwords (#818).
  • dfm objects with one or both dimensions haveing zero length, and empty kwic objects now display more appropriately in their print methods (per #811).
  • Pattern matches are now implemented more consistently across functions. In functions such as *_select, *_remove, tokens_compound, features has been replaced by pattern, and in kwic, keywords has been replaced by pattern. These all behave consistently with respect to pattern, which now has a unified single help page and parameter description.(#839) See also above new features related to phrase().
  • We have improved the performance of the C++ routines that handle many of the tokens_* functions using hashed tokens, making some of them 10x faster (#853).
  • Upgrades to the dfm_group() function now allow "empty" documents to be created using the fill = TRUE option, for making documents conform to a selection (similar to how dfm_select() works for features, when supplied a dfm as the pattern argument). The groups argument now behaves consistently across the functions where it is used. (#854)
  • dictionary() now requires its main argument to be a list, not a series of elements that can be used to build a list.
  • Some changes to the behaviour of tokens() have improved the behaviour of remove_hyphens = FALSE, which now behaves more correctly regardless of the setting of remove_punct (#887).
  • Improved cbind.dfm() function allows cbinding vectors, matrixes, and (recyclable) scalars to dfm objects.

Bug fixes and stability enhancements

  • For the underlying methods behind textstat_collocations(), we corrected the word matching, and lambda and z calculation methods, which were slightly incorrect before. We also removed the chi2, G2, and pmi statistics, because these were incorrectly calculated for size > 2.
  • LIWC-formatted dictionary import now robust to assignment to term assignment to missing categories.
  • textmodel_NB(x, y, distribution = "Bernoulli") was previously inactive even when this option was set. It has now been fully implemented and tested (#776, #780).
  • Separators including rare spacing characters are now handled more robustly by the remove_separators argument in tokens(). See #796.
  • Improved memory usage when computing ntoken() and ntype(). (#795)
  • Improvements to quanteda_options() now does not throw an error when quanteda functions are called directly without attaching the package. In addition, quanteda options can be set now in .Rprofile and will not be overwritten when the options initialization takes place when attaching the package.
  • Fixed a bug in textstat_readability() that wrongly computed the number of words with fewer than 3 syllables in a text; this affected the FOG.NRI and the Linsear.Write measures only.
  • Fixed mistakes in the computation of two docfreq schemes: "logave" and "inverseprob".
  • Fixed a bug in the handling of multi-thread options where the settings using quanteda_options() did not actually set the number of threads. In addition, we fixed a bug causing threading to be turned off on macOS (due to a check for a gcc version that is not used for compiling the macOS binaries) prevented multi-threading from being used at all on that platform.
  • Fixed a bug causing failure when functions that use quanteda_options() are called without the namespace or package being attached or loaded (#864).
  • Fixed a bug in overloading the View method that caused all named objects in the RStudio/Source pane to be named "x". (#893)

v0.9.9.65 - May 28, 2017

Changes since v0.9.9-50

New features

  • Corpus construction using corpus() now works for a tm::SimpleCorpus object. (#680)
  • Added corpus_trim() and char_trim() functions for selecting documents or subsets of documents based on sentence, paragraph, or document lengths.
  • Conversion of a dfm to an stm object now passes docvars through in the $meta of the return object.
  • New dfm_group(x, groups = ) command, a convenience wrapper around dfm.dfm(x, groups = ) (#725).
  • Methods for extending quanteda functions to readtext objects updated to match CRAN release of readtext package.
  • Corpus constructor methods for data.frame objects now conform to the "text interchange format" for corpus data.frames, automatically recognizing doc_id and text fields, which also provides interoperability with the readtext package. corpus construction methods are now more explicitly tailored to input object classes.

Bug fixes and stability enhancements

  • dfm_lookup() behaves more robustly on different platforms, especially for keys whose values match no features (#704).
  • textstat_simil() and textstat_dist() no longer take the n argument, as this was not sorting features in correct order.
  • Fixed failure of tokens(x, what = "character") when x included Twitter characters @ and # (#637).
  • Fixed bug #707 where ntype.dfm() produced an incorrect result.
  • Fixed bug #706 where textstat_readability() and textstat_lexdiv() for single-document returns when drop = TRUE.
  • Improved the robustness of corpus_reshape().
  • print, and head, and tail methods for dfm are more robust (#684).
  • Fixed bug in convert(x, to = "stm") caused by zero-count documents and zero-count features in a dfm (#699, #700, #701). This also removes docvar rows from $meta when this is passed through the dfm, for zero-count documents.
  • Corrected broken handling of nested Yoshikoder dictionaries in dictionary(). (#722)
  • dfm_compress now preserves a dfm's docvars if collapsing only on the features margin, which means that dfm_tolower() and dfm_toupper() no longer remove the docvars.
  • fcm_compress() now retains the fcm class, and generates and error when an asymmetric compression is attempted (#728).
  • textstat_collocations() now returns the collocations as character, not as a factor (#736)
  • Fixed a bug in dfm_lookup(x, exclusive = FALSE) wherein an empty dfm ws returned with there was no no match (#116).
  • Argument passing through dfm() to tokens() is now robust, and preserves variables defined in the calling environment (#721).
  • Fixed issues related to dictionaries failing when applying str(), names(), or other indexing operations, which started happening on Linux and Windows platforms following the CRAN move to 3.4.0. (#744)
  • Dictionary import using the LIWC format is more robust to improperly formatted input files (#685).
  • Weights applied using dfm_weight() now print friendlier error messages when the weight vector contains features not found in the dfm. See this Stack Overflow question for the use case that sparked this improvement.