No articles match
Introduction to data.table5 days ago
Data analysis using data.table | Data | Introduction | 1. Basics | a) What is data.table? | Note that: | b) General form - in what way is a data.table enhanced? | The way to read it (out loud) is: | c) Subset rows in i | -- Get all the flights with "JFK" as the origin airport in the month of June. | -- Get the first two rows from flights. | -- Sort flights first by column origin in ascending order, and then by dest in descending order: | order() is internally optimised | d) Select column(s) in j | -- Select arr_delay column, but return it as a vector. | -- Select arr_delay column, but return as a data.table instead. | Tip: | -- Select both arr_delay and dep_delay columns. | -- Select both arr_delay and dep_delay columns and rename them to delay_arr and delay_dep. | e) Compute or do in j | -- How many trips have had total delay < 0? | What's happening here? | f) Subset in i and do in j | -- Calculate the average arrival and departure delay for all flights with "JFK" as the origin airport in the month of June. | -- How many trips have been made in 2014 from "JFK" airport in the month of June? | g) Handle non-existing elements in i | -- What happens when querying for non-existing elements? | Special symbol .N: | h) Great! But how can I refer to columns by names in j (like in a data.frame)? | -- Select both arr_delay and dep_delay columns the data.frame way. | -- Select columns named in a variable using the .. prefix | -- Select columns named in a variable using with = FALSE | 2. Aggregations | a) Grouping using by | -- How can we get the number of trips corresponding to each origin airport? | -- How can we calculate the number of trips for each origin airport for carrier code "AA"? | -- How can we get the total number of trips for each origin, dest pair for carrier code "AA"? | -- How can we get the average arrival and departure delay for each orig,dest pair for each month for carrier code "AA"? | b) Sorted by: keyby | -- So how can we directly order by all the grouping variables? | c) Chaining | -- How can we order ans using the columns origin in ascending order, and dest in descending order? | d) Expressions in by | -- Can by accept expressions as well or does it just take columns? | e) Multiple columns in j - .SD | -- Do we have to compute mean() for each column individually? | Special symbol .SD: | -- How can we specify just the columns we would like to compute the mean() on? | .SDcols | f) Subset .SD for each group: | -- How can we return the first two rows for each month? | g) Why keep j so flexible? | -- How can we concatenate columns a and b for each group in ID? | -- What if we would like to have all the values of column a and b concatenated, but returned as a list column? | Summary | Using i: | Using j: | Using by: | And remember the tip:
Reference semantics12 days ago
Data | Introduction | 1. Reference semantics | a) Background | shallow vs deep copy | b) The := operator | 2. Add/update/delete columns by reference | a) Add columns by reference | -- How can we add columns speed and total delay of each flight to flights data.table? | Note that | b) Update some rows of columns by reference - sub-assign by reference | -- Replace those rows where hour == 24 with the value 0 | Exercise: | c) Delete column by reference | -- Remove delay column | d) := along with grouping using by | -- How can we add a new column which contains for each orig,dest pair the maximum speed? | Note on zero-length RHS and by | e) Multiple columns and := | -- How can we add two more columns computing max() of dep_delay and arr_delay for each month, using .SD? | -- How can we update multiple existing columns in place using .SD? | 3. := and copy() | a) := for its side effect | b) The copy() function | c) Selecting columns: $ / [[...]] vs [, col] | Summary | The := operator
Importing data.table18 days ago
Why to import data.table | Importing data.table is easy | DESCRIPTION file | NAMESPACE file | Usage | Testing | Testing using testthat | Dealing with "undefined global functions or variables" | Care needed when providing and using options | Troubleshooting | License | Optionally import data.table: Suggests | data.table in Imports but nothing imported | Further information on dependencies | Importing data.table C routines | How to convert your Depends dependency on data.table to Imports | Step 0. Ensure your package is passing R CMD check initially | Step 1. Update the DESCRIPTION file to put data.table in Imports, not Depends | Step 2.1: Run R CMD check | Step 2.2: Modify the NAMESPACE file | Blanket import | Step 3: Update Your R code files outside the package's R/ directory | Benefits of using Imports
Fast Read and Fast Write5 months ago
1. fread() | 1.1 Using command line tools directly | 1.1.1 Reading directly from a text string | 1.1.2 Reading from URLs | 1.1.3 Automatic decompression of compressed files | 1.2 Automatic separator and skip detection | 1.3 High-Quality Automatic Column Type Detection | 1.4 Early Error Detection at End-of-File | 1.5 integer64 Support | 1.6 Drop or Select Columns by Name or Position | 1.7 Automatic Quote Escape Detection (Including No-Escape) | 2. fwrite() | 2.1 Intelligent and Minimalist Quoting (quote="auto") | 2.2 Fine-Grained Date/Time Serialization (dateTimeAs argument) | 2.3 Handling of bit64::integer64 | 2.4 Column Order and Subset Control | 3. A Note on Performance
Joins in data.table5 months ago
1. Defining example data | 2. data.table joining syntax | 3. Equi joins | 3.1. Right join | 3.1.1. Joining by a list argument | 3.1.2. Alternatives to define the on argument | 3.1.3. Operations after joining | Managing shared column Names with the j argument | Summarizing with on in data.table | 3.1.4. Joining based on several columns | 3.2. Inner join | 3.3. Anti-join | 3.4. Semi join | 3.5. Left join | 3.5.1. Joining after chain operations | 3.6. Many to many join | 3.6.1. Selecting one match | 3.6.2. Cross join | 3.7. Full join | 4. Non-equi join | 4.1 Output column names in non-equi joins | 5. Rolling join | 6. Taking advantage of joining speed | 6.1. Subsets as joins | 6.2. Updating by reference | Reference
Keys and fast binary search based subset6 months ago
Data | Introduction | 1. Keys | a) What is a key? | Keys and their properties | b) Set, get and use keys on a data.table | -- How can we set the column origin as key in the data.table flights? | set* and :=: | -- Use the key column origin to subset all rows where the origin airport matches "JFK" | -- How can we get the column(s) a data.table is keyed by? | c) Keys and multiple columns | -- How can I set keys on both origin and dest columns? | -- Subset all rows using key columns where first key column origin matches "JFK" and second key column dest matches "MIA" | How does the subset work here? | -- Subset all rows where just the first key column origin matches "JFK" | -- Subset all rows where just the second key column dest matches "MIA" | What's happening here? | 2. Combining keys with j and by | a) Select in j | -- Return arr_delay column as a data.table corresponding to origin = "LGA" and dest = "TPA". | b) Chaining | -- On the result obtained above, use chaining to order the column in decreasing order. | c) Compute or do in j | -- Find the maximum arrival delay corresponding to origin = "LGA" and dest = "TPA". | d) sub-assign by reference using := in j | e) Aggregation using by | -- Get the maximum departure delay for each month corresponding to origin = "JFK". Order the result by month | 3. Additional arguments - mult and nomatch | a) The mult argument | -- Subset only the first matching row from all rows where origin matches "JFK" and dest matches "MIA" | -- Subset only the last matching row of all the rows where origin matches "LGA", "JFK", "EWR" and dest matches "XNA" | b) The nomatch argument | -- From the previous example, Subset all rows only if there's a match | 4. binary search vs vector scans | a) Performance of binary search approach | b) Why does keying a data.table result in blazing fast subsets? | Vector scan approach | Binary search approach | Summary
Programming on data.table6 months ago
Introduction | Problem description | Example | Approaches to the problem | Avoid lazy evaluation | Use of parse / eval | Computing on the language | Use third party packages | Substituting variables and names | Substitute functions | Substitute variables and character values | Substituting lists of arbitrary length | Substitution of a complex query | Common mistakes | Use env argument from inside another function | Retired interfaces | get | mget | eval
Secondary indices and auto indexing6 months ago
Data | Introduction | 1. Secondary indices | a) What are secondary indices? | Keyed vs. Indexed Subsetting | b) Set and get secondary indices | -- How can we set the column origin as a secondary index in the data.table flights? | -- How can we get all the secondary indices set so far in flights? | c) Why do we need secondary indices? | -- Reordering a data.table can be expensive and not always ideal | setkey() requires: | -- There can be only one key at the most | -- Secondary indices can be reused | -- The new on argument allows for cleaner syntax and automatic creation and reuse of secondary indices | on argument | 2. Fast subsetting using on argument and secondary indices | a) Fast subsets in i | -- Subset all rows where the origin airport matches "JFK" using on | -- How can I subset based on origin and dest columns? | b) Select in j | -- Return arr_delay column alone as a data.table corresponding to origin = "LGA" and dest = "TPA" | c) Chaining | -- On the result obtained above, use chaining to order the column in decreasing order. | d) Compute or do in j | -- Find the maximum arrival delay corresponding to origin = "LGA" and dest = "TPA". | e) sub-assign by reference using := in j | f) Aggregation using by | -- Get the maximum departure delay for each month corresponding to origin = "JFK". Order the result by month | g) The mult argument | -- Subset only the first matching row where dest matches "BOS" and "DAY" | -- Subset only the last matching row where origin matches "LGA", "JFK", "EWR" and dest matches "XNA" | h) The nomatch argument | -- From the previous example, subset all rows only if there's a match | 3. Auto indexing
Benchmarking data.table6 months ago
fread: clear caches | subset: threshold for index optimization on compound queries | subset: index aware benchmarking | by reference operations | try to benchmark atomic processes | avoid class coercion | avoid microbenchmark(..., times=100) | multithreaded processing | inside a loop prefer set instead of := | inside a loop prefer setDT instead of data.table()
Efficient reshaping using data.tables6 months ago
Data | Introduction | 1. Default functionality | a) melting data.tables (wide to long) | - Convert DT to long form where each dob is a separate observation. | - Name the variable and value columns to child and dob respectively | b) dcasting data.tables (long to wide) | - How can we get back to the original data table DT from DT.m1? | - Starting from DT.m1, how can we get the number of children in each family? | 2. Limitations in previous melt/dcast approaches | Issues | 3. Enhanced (new) functionality | a) Enhanced melt | - melt multiple columns simultaneously | - Using patterns() | - Using measure() to specify measure.vars via separator or pattern | b) Enhanced dcast | - Casting multiple value.vars simultaneously | Multiple functions to fun.aggregate:
Frequently Asked Questions about data.table6 months ago
Beginner FAQs | Why do DT[ , 5] and DT[2, 5] return a 1-column data.table rather than vectors like data.frame? | Why does DT[,"region"] return a 1-column data.table rather than a vector? | Why does DT[, region] return a vector for the "region" column? I'd like a 1-column data.table. | Why does DT[ , x, y, z] not work? I wanted the 3 columns x,y and z. | I assigned a variable mycol="x" but then DT[, mycol] returns an error. How do I get it to look up the column name contained in the mycol variable? | What are the benefits of being able to use column names as if they are variables inside DT[...]? | OK, I'm starting to see what data.table is about, but why didn't you just enhance data.frame in R? Why does it have to be a new package? | Why are the defaults the way they are? Why does it work the way it does? | Isn't this already done by with() and subset() in base? | Why does X[Y] return all the columns from Y too? Shouldn't it return a subset of X? | What is the difference between X[Y] and merge(X, Y)? | Anything else about X[Y, sum(foo*bar)]? | That's nice. How did you manage to change it given that users depended on the old behaviour? | General Syntax | How can I avoid writing a really long j expression? You've said that I should use the column names, but I've got a lot of columns. | Why is the default for mult now "all"? | I'm using c() in j and getting strange results. | I have built up a complex table with many columns. I want to use it as a template for a new table; i.e., create a new table with no rows, but with the column names and types copied from my table. Can I do that easily? | Is a null data.table the same as DT[0]? | Why has the DT() alias been removed? | But my code uses j = DT(...) and it works. The previous FAQ says that DT() has been removed. | What are the scoping rules for j expressions? | Can I trace the j expression as it runs through the groups? | Inside each group, why are the group variables length-1? | Only the first 10 rows are printed, how do I print more? | With an X[Y] join, what if X contains a column called "Y"? | X[Z[Y]] is failing because X contains a column "Y". I'd like it to use the table Y in calling scope. | Can you explain further why data.table is inspired by A[B] syntax in base? | Can base be changed to do this then, rather than a new package? | I've heard that data.table syntax is analogous to SQL. | What are the smaller syntax differences between data.frame and data.table | I'm using j for its side effect only, but I'm still getting data returned. How do I stop that? | Why does [.data.table now have a drop argument from v1.5? | Rolling joins are cool and very fast! Was that hard to program? | Why does DT[i, col := value] return the whole of DT? I expected either no visible value (consistent with <-), or a message or return value containing how many rows were updated. It isn't obvious that the data has indeed been updated by reference. | OK, thanks. What was so difficult about the result of DT[i, col := value] being returned invisibly? | Why do I have to type DT sometimes twice after using := to print the result to console? | I've noticed that base::cbind.data.frame (and base::rbind.data.frame) appear to be changed by data.table. How is this possible? Why? | I've read about method dispatch (e.g. merge may or may not dispatch to merge.data.table) but how does R know how to dispatch? Are dots significant or special? How on earth does R know which function to dispatch and when? | Why do T and F behave differently from TRUE and FALSE in some data.table queries? | Questions relating to compute time | I have 20 columns and a large number of rows. Why is an expression of one column so quick? | I don't have a key on a large table, but grouping is still really quick. Why is that? | Why is grouping by columns in the key faster than an ad hoc by? | What are primary and secondary indexes in data.table? | Error messages | "Could not find function DT" | "unused argument(s) (MySum = sum(v))" | "translateCharUTF8 must be called on a CHARSXP" | cbind(DT, DF) returns a strange format, e.g. Integer,5 | "cannot change value of locked binding for .SD" | "cannot change value of locked binding for .N" | Warning messages | "The following object(s) are masked from package:base: cbind, rbind" | "Coerced numeric RHS to integer to match the column's type" | Reading data.table from RDS or RData file | General questions about the package | v1.3 appears to be missing from the CRAN archive? | Is data.table compatible with S-plus? | Is it available for Linux, Mac and Windows? | I think it's great. What can I do? | I think it's not great. How do I warn others about my experience? | I have a question. I know the r-help posting guide tells me to contact the maintainer (not r-help), but is there a larger group of people I can ask? | Where are the datatable-help archives? | I'd prefer not to post on the Issues page, can I mail just one or two people privately? | I have created a package that uses data.table. How do I ensure my package is data.table-aware so that inheritance from data.frame works?
Using .SD for Data Analysis6 months ago
What is .SD? | Loading and Previewing Lahman Data | .SD on Ungrouped Data | Column Subsetting: .SDcols | Column Type Conversion | Controlling a Model's Right-Hand Side | Conditional Joins | Grouped .SD operations | Group Subsetting | Group Optima | Grouped Regression
How to Datapasta6 years ago
Typical Usage with Rstudio | Pasting a table as a formatted tibble definition with tribble_paste() | Pasting a list as a horizontal vector with vector_paste() | Pasting a list as a vertical vector with vector_paste_vertical() | Outputting data from your R environment | Output to R with dpasta() | Avoiding fiddly data formatting | Fiddle Selections until they're better | Toggle Quotes | Output to clipboard with dmdclip() | Usage without RStudio | Custom Behaviour for Your Unique Snowflake Setup | Configurable Options | Upping the row guard | Dealing with "," decimal marks
Datapasta in the cloud6 years ago
Fallback 1: Text selection | Fallback 2: Pop-up text editor | Configuration