R-universe - rorylawless (Rory Lawless)

Introduction to data.table5 days ago

Data analysis using data.table | Data | Introduction | 1. Basics | a) What is data.table? | Note that: | b) General form - in what way is a data.table enhanced? | The way to read it (out loud) is: | c) Subset rows in i | -- Get all the flights with "JFK" as the origin airport in the month of June. | -- Get the first two rows from flights. | -- Sort flights first by column origin in ascending order, and then by dest in descending order: | order() is internally optimised | d) Select column(s) in j | -- Select arr_delay column, but return it as a vector. | -- Select arr_delay column, but return as a data.table instead. | Tip: | -- Select both arr_delay and dep_delay columns. | -- Select both arr_delay and dep_delay columns and rename them to delay_arr and delay_dep. | e) Compute or do in j | -- How many trips have had total delay < 0? | What's happening here? | f) Subset in i and do in j | -- Calculate the average arrival and departure delay for all flights with "JFK" as the origin airport in the month of June. | -- How many trips have been made in 2014 from "JFK" airport in the month of June? | g) Handle non-existing elements in i | -- What happens when querying for non-existing elements? | Special symbol .N: | h) Great! But how can I refer to columns by names in j (like in a data.frame)? | -- Select both arr_delay and dep_delay columns the data.frame way. | -- Select columns named in a variable using the .. prefix | -- Select columns named in a variable using with = FALSE | 2. Aggregations | a) Grouping using by | -- How can we get the number of trips corresponding to each origin airport? | -- How can we calculate the number of trips for each origin airport for carrier code "AA"? | -- How can we get the total number of trips for each origin, dest pair for carrier code "AA"? | -- How can we get the average arrival and departure delay for each orig,dest pair for each month for carrier code "AA"? | b) Sorted by: keyby | -- So how can we directly order by all the grouping variables? | c) Chaining | -- How can we order ans using the columns origin in ascending order, and dest in descending order? | d) Expressions in by | -- Can by accept expressions as well or does it just take columns? | e) Multiple columns in j - .SD | -- Do we have to compute mean() for each column individually? | Special symbol .SD: | -- How can we specify just the columns we would like to compute the mean() on? | .SDcols | f) Subset .SD for each group: | -- How can we return the first two rows for each month? | g) Why keep j so flexible? | -- How can we concatenate columns a and b for each group in ID? | -- What if we would like to have all the values of column a and b concatenated, but returned as a list column? | Summary | Using i: | Using j: | Using by: | And remember the tip:

data.table 1.18.99Tyson Barrett datatable-intro.Rmd

Reference semantics12 days ago

data.table 1.18.99Tyson Barrett datatable-reference-semantics.Rmd

Importing data.table18 days ago

data.table 1.18.99Tyson Barrett datatable-importing.Rmd

Fast Read and Fast Write5 months ago

data.table 1.18.99Tyson Barrett datatable-fread-and-fwrite.Rmd

Joins in data.table5 months ago

data.table 1.18.99Tyson Barrett datatable-joins.Rmd

Keys and fast binary search based subset6 months ago

Data | Introduction | 1. Keys | a) What is a key? | Keys and their properties | b) Set, get and use keys on a data.table | -- How can we set the column origin as key in the data.table flights? | set* and :=: | -- Use the key column origin to subset all rows where the origin airport matches "JFK" | -- How can we get the column(s) a data.table is keyed by? | c) Keys and multiple columns | -- How can I set keys on both origin and dest columns? | -- Subset all rows using key columns where first key column origin matches "JFK" and second key column dest matches "MIA" | How does the subset work here? | -- Subset all rows where just the first key column origin matches "JFK" | -- Subset all rows where just the second key column dest matches "MIA" | What's happening here? | 2. Combining keys with j and by | a) Select in j | -- Return arr_delay column as a data.table corresponding to origin = "LGA" and dest = "TPA". | b) Chaining | -- On the result obtained above, use chaining to order the column in decreasing order. | c) Compute or do in j | -- Find the maximum arrival delay corresponding to origin = "LGA" and dest = "TPA". | d) sub-assign by reference using := in j | e) Aggregation using by | -- Get the maximum departure delay for each month corresponding to origin = "JFK". Order the result by month | 3. Additional arguments - mult and nomatch | a) The mult argument | -- Subset only the first matching row from all rows where origin matches "JFK" and dest matches "MIA" | -- Subset only the last matching row of all the rows where origin matches "LGA", "JFK", "EWR" and dest matches "XNA" | b) The nomatch argument | -- From the previous example, Subset all rows only if there's a match | 4. binary search vs vector scans | a) Performance of binary search approach | b) Why does keying a data.table result in blazing fast subsets? | Vector scan approach | Binary search approach | Summary

data.table 1.18.99Tyson Barrett datatable-keys-fast-subset.Rmd

Programming on data.table6 months ago

data.table 1.18.99Tyson Barrett datatable-programming.Rmd

Secondary indices and auto indexing6 months ago

Data | Introduction | 1. Secondary indices | a) What are secondary indices? | Keyed vs. Indexed Subsetting | b) Set and get secondary indices | -- How can we set the column origin as a secondary index in the data.table flights? | -- How can we get all the secondary indices set so far in flights? | c) Why do we need secondary indices? | -- Reordering a data.table can be expensive and not always ideal | setkey() requires: | -- There can be only one key at the most | -- Secondary indices can be reused | -- The new on argument allows for cleaner syntax and automatic creation and reuse of secondary indices | on argument | 2. Fast subsetting using on argument and secondary indices | a) Fast subsets in i | -- Subset all rows where the origin airport matches "JFK" using on | -- How can I subset based on origin and dest columns? | b) Select in j | -- Return arr_delay column alone as a data.table corresponding to origin = "LGA" and dest = "TPA" | c) Chaining | -- On the result obtained above, use chaining to order the column in decreasing order. | d) Compute or do in j | -- Find the maximum arrival delay corresponding to origin = "LGA" and dest = "TPA". | e) sub-assign by reference using := in j | f) Aggregation using by | -- Get the maximum departure delay for each month corresponding to origin = "JFK". Order the result by month | g) The mult argument | -- Subset only the first matching row where dest matches "BOS" and "DAY" | -- Subset only the last matching row where origin matches "LGA", "JFK", "EWR" and dest matches "XNA" | h) The nomatch argument | -- From the previous example, subset all rows only if there's a match | 3. Auto indexing

data.table 1.18.99Tyson Barrett datatable-secondary-indices-and-auto-indexing.Rmd

Benchmarking data.table6 months ago

data.table 1.18.99Tyson Barrett datatable-benchmarking.Rmd

Efficient reshaping using data.tables6 months ago

data.table 1.18.99Tyson Barrett datatable-reshape.Rmd

Frequently Asked Questions about data.table6 months ago

Beginner FAQs | Why do DT[ , 5] and DT[2, 5] return a 1-column data.table rather than vectors like data.frame? | Why does DT[,"region"] return a 1-column data.table rather than a vector? | Why does DT[, region] return a vector for the "region" column? I'd like a 1-column data.table. | Why does DT[ , x, y, z] not work? I wanted the 3 columns x,y and z. | I assigned a variable mycol="x" but then DT[, mycol] returns an error. How do I get it to look up the column name contained in the mycol variable? | What are the benefits of being able to use column names as if they are variables inside DT[...]? | OK, I'm starting to see what data.table is about, but why didn't you just enhance data.frame in R? Why does it have to be a new package? | Why are the defaults the way they are? Why does it work the way it does? | Isn't this already done by with() and subset() in base? | Why does X[Y] return all the columns from Y too? Shouldn't it return a subset of X? | What is the difference between X[Y] and merge(X, Y)? | Anything else about X[Y, sum(foo*bar)]? | That's nice. How did you manage to change it given that users depended on the old behaviour? | General Syntax | How can I avoid writing a really long j expression? You've said that I should use the column names, but I've got a lot of columns. | Why is the default for mult now "all"? | I'm using c() in j and getting strange results. | I have built up a complex table with many columns. I want to use it as a template for a new table; i.e., create a new table with no rows, but with the column names and types copied from my table. Can I do that easily? | Is a null data.table the same as DT[0]? | Why has the DT() alias been removed? | But my code uses j = DT(...) and it works. The previous FAQ says that DT() has been removed. | What are the scoping rules for j expressions? | Can I trace the j expression as it runs through the groups? | Inside each group, why are the group variables length-1? | Only the first 10 rows are printed, how do I print more? | With an X[Y] join, what if X contains a column called "Y"? | X[Z[Y]] is failing because X contains a column "Y". I'd like it to use the table Y in calling scope. | Can you explain further why data.table is inspired by A[B] syntax in base? | Can base be changed to do this then, rather than a new package? | I've heard that data.table syntax is analogous to SQL. | What are the smaller syntax differences between data.frame and data.table | I'm using j for its side effect only, but I'm still getting data returned. How do I stop that? | Why does [.data.table now have a drop argument from v1.5? | Rolling joins are cool and very fast! Was that hard to program? | Why does DT[i, col := value] return the whole of DT? I expected either no visible value (consistent with <-), or a message or return value containing how many rows were updated. It isn't obvious that the data has indeed been updated by reference. | OK, thanks. What was so difficult about the result of DT[i, col := value] being returned invisibly? | Why do I have to type DT sometimes twice after using := to print the result to console? | I've noticed that base::cbind.data.frame (and base::rbind.data.frame) appear to be changed by data.table. How is this possible? Why? | I've read about method dispatch (e.g. merge may or may not dispatch to merge.data.table) but how does R know how to dispatch? Are dots significant or special? How on earth does R know which function to dispatch and when? | Why do T and F behave differently from TRUE and FALSE in some data.table queries? | Questions relating to compute time | I have 20 columns and a large number of rows. Why is an expression of one column so quick? | I don't have a key on a large table, but grouping is still really quick. Why is that? | Why is grouping by columns in the key faster than an ad hoc by? | What are primary and secondary indexes in data.table? | Error messages | "Could not find function DT" | "unused argument(s) (MySum = sum(v))" | "translateCharUTF8 must be called on a CHARSXP" | cbind(DT, DF) returns a strange format, e.g. Integer,5 | "cannot change value of locked binding for .SD" | "cannot change value of locked binding for .N" | Warning messages | "The following object(s) are masked from package:base: cbind, rbind" | "Coerced numeric RHS to integer to match the column's type" | Reading data.table from RDS or RData file | General questions about the package | v1.3 appears to be missing from the CRAN archive? | Is data.table compatible with S-plus? | Is it available for Linux, Mac and Windows? | I think it's great. What can I do? | I think it's not great. How do I warn others about my experience? | I have a question. I know the r-help posting guide tells me to contact the maintainer (not r-help), but is there a larger group of people I can ask? | Where are the datatable-help archives? | I'd prefer not to post on the Issues page, can I mail just one or two people privately? | I have created a package that uses data.table. How do I ensure my package is data.table-aware so that inheritance from data.frame works?

data.table 1.18.99Tyson Barrett datatable-faq.Rmd

Using .SD for Data Analysis6 months ago

data.table 1.18.99Tyson Barrett datatable-sd-usage.Rmd

How to Datapasta6 years ago

datapasta 3.2.2Miles McBainhow-to-datapasta.Rmd

Datapasta in the cloud6 years ago

Fallback 1: Text selection | Fallback 2: Pop-up text editor | Configuration

datapasta 3.2.2Miles McBain datapasta-in-the-cloud.Rmd