Split Apply Combine Revisited

After discussing my Split-Apply-Combine presentation with an attendee, I decided to take their suggestions and incorporate them into a new tutorial online. I created some graphics that provided the logical flow of the strategy and disected the commands, highlighting inputs/outputs, and the various components of a command. I’d like to get your feedback on this format as to whether or not this makes it easier to understand the material. The full HTML document has been uploaded to our Meetup page, as SplitApplyCombineTutorialHTML.html. I’ve included the graphics of the logical flow here, which is a synthesis of the tutorial.  They show up much better on the white background within the HTML document.

Split-Apply-Combine Base R Functions

apply sapply split apply combine

Base R Functions: Tapply & Aggregate

tapply aggregate split apply combine

PLYR Package: ddply

ddply Split Apply Combine


Split Apply & Combine Tutorial

Thanks to all that attended this session. I’ve included some links that were discussed during the presentation and to the .R file that was uploaded to the Meetup page.

R File

Located in the Scottsdale BI Files section. Look for the file uploaded on Dec 16, 2014 called splitApplyCombineTutorialFinal. If you click on the filename, it will download it automatically.

Pertinent Links to Resources/White Papers

Hadley Wickham’s writeup, “The Split-Apply-Combine Strategy for Data Analysis”

Sean Anderson’s writeup: “A Quick Introduction to plyr”

PLYR Documentation

CRAN Task View webpages: This is a very useful guide that will help you uncover solutions for common R Programming problems. It will point you to the packages and functions useful for certain disciplines and methodologies.

Discussion of Future Meetups

We really need your feedback and ideas to make our Meetup successful.

R Workshop: Venue, Time, Topics
  • Venue: We need a place that has proper plugins for laptops. Possible locations: ASU on weekends, ASU Skysong, or possibly Public Library.
    • Public libraries often have limitations on length of meeting, but some computer-related Meetups are held at Burton Barr (Phoenix main library)
    • Parking at ASU is only free on Sundays.
    • ASU Skysong is a very nice facility with the right tech and parking ammenities. There are costs associated with its use.
  • Time: Weekends seem to work better than Weeknights.
  • Possible Topics:
    • Machine Learning
    • Regression Modelling
    • Cleaning Data
    • Data Visualization
    • Data Presentation (Knitr, R Markdown, LaTeX, Shiny)
    • Package TIDR or Reshape, which is a clean/powerful package which allows for data cleansing
    • R & Python
  • Presentation Format: We got some great feedback into how to create a better tutorial. Attendees commented that they would like to see:
    • Case Studies that reflect real-world problems.
    • Depth of the subject is preferred over breadth.
    • Go into more depth on each function, explaining how it ties back to the documentation, what each of the inputs are, and what the outputs are.
    • Point out common gotchas or errors that you ran into while creating the scripts (let other people learn from the mistakes)
    • Provide a list of skills/homework that will be useful to know prior to the tutorial/presentation

Other Business Analytics Topics:

  • Hadoop
  • R & Python
  • Pandas: Python Library

Announcement of Related Meetup of Interest

The Phoenix Biomedical Informatics Group will be sponsoring a presentation on using data mining techniques of image data within the medical industry entitled, “Processing of Medical Images Using Machine Learning Techniques”. It will be held at ASU’s Skysong, January 13th.


And the winner is…

When we finished the R Bootcamp we promised that if you offered feedback you would be entered into a drawing for your choice of the box of goodies we were expecting from Revolution Analytics. The box arrived, and now we have a winner: Rahul Garkhail.

For those that are interested in the details, I did the drawing in R. I created a list of the people that offered feedback in a variable called “drawing.” I then found the length of drawing and selected a random element of the list by using the sample command.

length(drawing) returned 19

x = sample( 1:19,1) returned a random value between 1 and 19 inclusive. (I ran several tests to make sure it included the bounding numbers.)

I then returned sample[x] to determine the winner’s name.

Congratulations, Rahul!

What topics would you like to see presented?

Now that we’ve had the R Bootcamp we are moving into the second phase of world domination: regularly scheduled workshops. Some part of each workshop will definitely involve questions so we can help each other out, but it would also be nice to schedule more in-depth topics so we can get a greater understanding of how to use R. Please let us know what topics interest you through the poll below. You can also add answers.

Popular Feedback Requests an R Case Study

About 20 people filled out feedback forms at the end of the R Bootcamp. We asked people to let us know what topics they’d like to see presentations on. I was thinking we would get answers like “ggplot” or “working with data frames” or “regression models.” Instead, the most popular request was for case studies to see how people are utilizing R in their workplace.

This prompted me to ask on one of the R groups on LinkedIn to see if anyone in the Phoenix area would step forward and present their experiences to us. I’ll also put that request here. If you are using R at work and would like to share with us what you are doing, please let us know.

In the meantime, I’m going to keep asking around, so I may post the question on a few other R groups on LinkedIn, ping our friends at Revolution Analytics, or reach out to the Los Angeles R User Group.

You learn something new every time you hold a Bootcamp and ask for feedback. 🙂

Wow! That was awesome!

We had over 30 people gathered together on a Saturday morning to go over the basics of the R language. It went well and the feedback was very positive. We may have to make this an annual event. (We’ll try to post some pictures.)

We spent just under four hours introducing people to the basics of R data types, popular and useful commands in R, reading data into R, getting to know your dataset, merging a couple of datasets, and visualizing data with everything from base plotting to lattice to ggplot2 to mapping. We topped it off by getting introduced to Shiny.

We were sponsored by Revolution Analytics (we’ll be putting something official on the website soon) and we promised those who filled out feedback forms at the end that they’d be entered into a drawing for first pick of the goodies that we were expecting in the mail from Revolution Analytics.  We got the box in the mail today, and the lucky winner will have a choice of a t-shirt (two styles) or a monkey with a cape. I think he might be named Chebyshev. The monkey, not the winner.

I’m hoping soon we can get a poll on the website that will allow people to vote for topics they’d like covered. We are going to start doing regular R workshops soon, too. Our R family is growing. If you are reading this and you are interested in either attending or presenting, please fill out the contact form and let us know.

Until next time, see you lateR!


R Bootcamp!

It is final! We will be having an R Bootcamp on August 2nd from 8:30 to 1:30. Come learn the basics of how to get data into R, perform some basic manipulations, and create basic plots and graphs. Learn how to use R Studio, and prepare yourself for our continuing series of workshops where you can continue to improve your skills.

The R Bootcamp will take place at the University of Advancing Technology at 2625 W Baseline Rd, Tempe, AZ. It is sponsored by Revolution Analytics, the sponsor of our R group and provider of R-related products and services.

R Bootcamp Planning Meeting

Eight R enthusiasts met on July 1st at Paradise Bakery to discuss how to run the R Bootcamp and establish presenters. The goal of the R Bootcamp is to get people up to speed on basic R skills such that there is a base of knowledge that all can learn from each other and so that everyone will get something out of ongoing presentations.

There are some existing R bootcamps whose material we can plagerize ****errrr**** I mean borrow from. LA R Group – very large group. Also one in Bay Area.

When Will the R Bootcamp Be:

We are aiming for a half-day session on August 2nd.

Many R Bootcamps are 2 days long, though we found some examples that were just a few hours. We decided on a half day session as a good introduction to R that wouldn’t exhaust or discourage anyone from attending.

Possible Material:

SWIRL: this is an R package from Johns Hopkins that provides an R Tutorial

Google Drive Dataset – Medicare Hospital Data Set which can be used for programming assignments.

Presenters and Topics:

This agenda is still a work in progress and may change based on feedback from the Meetup on July 15.

  1. Introduction (30 minutes) – Bill – This will be a quick check to make sure everyone has R and R Studio installed correctly and working and show people how to get help within R.
  2. Getting Data In To R (1 hour) – Ram and Marco – This will provide the basics on reading data in to R
  3. Working with Data (1 hour) – Bill – This will show people how to convert, subset, split, and merge datasets
  4. Graphics and Visualization (1 hour) – Chris and Belinda – This will show people how to create graphs and plots in R and introduce them to ggplot2
  5. Where To Go Next (30 minutes) – Lisa – This will show people what to do when they leave. It will introduce SWIRL, Coursera courses, etc.


UAT or the Hive at Burton Barr Public Library

Update: The Hive can only be reserved for 3 hour blocks, so we are working on reserving a classroom at UAT.


There will be another meeting on July 15th to go over presentation material.

Planning an R Bootcamp

Exciting news!!! We are going to hold an R Bootcamp, and you can be a part of planning it.

Sure, you can pay hundreds of dollars to attend an R Bootcamp, but here you can actually be part of presenting it. We’ll be meeting on July 1 at 6:00 at the Paradise Bakery at 1825 E Guadalupe Rd in Tempe, AZ to plan it out. Even if you aren’t interested in presenting, feel free to join us to give your input on what you would like to see covered. We’ll likely be picking a date, selecting topics, and dividing up the presentations.

This is exciting! Hope to see you there!

Working Group Topics

R Studio Working Environment

The collaborative working group will likely be comprised of those with minimal to moderate R programming skills. As such, the first few weeks will be spent covering basic topics, then proceed towards discussions of how to leverage these tools to clean and examine data sets.

The following is a list of topics we plan to cover. Please feel free to indicate if there are additional topics you are interested in.

  1. The basics: Installing R and R Studio, how to use GitHub
  2. Data Manipulation: data types, subsetting, and list and file concatenation
  3. Examining the most commonly used data packages
  4. Creating functions and programs
  5. Working with data sources (files, XML, JSON, HTML)
  6. Statistical analysis techniques and how/when they are used
  7. Data visualization: graphing and plotting
  8. Documentation: R Markdown and sharing results
  9. Text analysis and data scraping
  10. Regular Expressions: data parsing techniques
  11. Machine learning, clustering, and data mining

We discussed how often we should meet: bimonthly or monthly. Additional input is needed from the group to determine the appropriate level. We also need to determine if there are some in the group that are willing to present (advantage – the best way to learn is to teach).