Passing variable names to mapply (using reshape) - r

I'm trying to take a long-format dataframe and create several wide-format dataframes from it according to a list of different variables.
My thought is to use mapply to pass the set of variables I want to filter by positionally to the dataset. But it doesn't look like mapply can read in the list of vars.
Data:
library(dplyr)
library(reshape2)
set.seed(1234)
data <- data.frame(
region = sample(c("northeast","midwest","west"), 40, replace = TRUE),
date = rep(seq(as.Date("2010-02-01"), length=4, by = "1 day"),10),
employed = sample(50000:100000, 40, replace = T),
girls = sample(1:40),
guys = sample(1:40)
)
For each of the quantitative variables (employed, girls, and guys), I want to create a wide-format dataframe with dates as rows, regions as columns.
Could I use mapply to do this more succinctly than running melt and dcast separately for each of {"employed","girls", "guys"}?
For example:
mapply(function(d,y) {melt(d[,c('region','date',y)], id.vars=c('region','date'))},
data,
c('employed','girls','guys')
)
tells me:
>Error in `[.default`(d, , c("region", "date", y)) :
incorrect number of dimensions
What I'm looking to get is a list of the wide-format dataframes; I figured mapply would be the easiest way to pass multiple arguments, but if there's a better way to go at this, I'm all for it.
Example:
$employed
date midwest northeast west
1 2010-02-01 62196 513366 119070
2 2010-02-02 334849 271383 160552
3 2010-02-03 187070 320594 119721
4 2010-02-04 146575 311999 310009
$girls
date midwest northeast west
1 2010-02-01 40 154 26
2 2010-02-02 88 76 61
3 2010-02-03 67 84 39
4 2010-02-04 48 95 42
$guys
date midwest northeast west
1 2010-02-01 16 140 43
2 2010-02-02 115 70 43
3 2010-02-03 63 64 42
4 2010-02-04 54 94 76

The old standby of split/lapply
d<-melt(data,id.vars=c("region","date"))
lapply(split(d,d$variable),function(x) dcast(x,date~region,sum))
Example data has multiple matches, so I used an aggregating function of sum.

Related

How do I plot from data frames?

From the following code, I got a data frame in R. I am trying to plot the data frame; however, I am only interested in the score they got on the Final. So I want the x-axis to be the number of students, which is 6, since that's how many data points their are, and I want the y-axis to be Final. Is there a way to do this from just the data frame?
data <- data.frame(Score1=c(100,36,58,77,99,92),Score2=c(56,68,68,98,15,35), Final=c(63,87,89,45,99,18))
Output listed below:
Score1 Score2 Final
1 100 56 63
2 36 68 87
3 58 68 89
4 77 98 45
5 99 15 99
6 92 35 18
Or will I have to do something like this instead? But this gives me an error that the lengths are not the same.
data <- data.frame(Score1=c(100,36,58,77,99,92),Score2=c(56,68,68,98,15,35))
Final=c(63,87,89,45,99,18)
f.data <- cbind(data,Final)
b <- 6
plot(b,Final)
Use the following
library(ggplot2);
qplot( x = 1:6, y = data$Final)
The code below can do the trick.
plot(data$Final)

Looping through rows, creating and reusing multiple variables

I am building a streambed hydrology calculator in R using multiple tables from an Access database. I am having trouble automating and calculating the same set of indices for multiple sites. The following sample dataset describes my data structure:
> Thalweg
StationID AB0 AB1 AB2 AB3 AB4 AB5 BC1 BC2 BC3 BC4 Xdep_Vdep
1 1AAUA017.60 47 45 44 55 54 6 15 39 15 11 18.29
2 1AXKR000.77 30 27 24 19 20 18 9 12 21 13 6.46
3 2-BGU005.95 52 67 62 42 28 25 23 26 11 19 20.18
4 2-BLG011.41 66 85 77 83 63 35 10 70 95 90 67.64
5 2-CSR003.94 29 35 46 14 19 14 13 13 21 48 6.74
where each column represents certain field-measured parameters (i.e. depth of a reach section) and each row represents a different site.
I have successfully used the apply functions to simultaneously calculate simple functions on multiple rows:
> Xdepth <- apply(Thalweg[, 2:11], 1, mean) # Mean Depth
> Xdepth
1 2 3 4 5
33.1 19.3 35.5 67.4 25.2
and appending the results back to the proper station in a dataframe.
However, I am struggling when I want to calculate and save variables that are subsequently used for further calculations. I cannot seem to loop or apply the same function to multiple columns on a single row and complete the same calculations over the next row without mixing variables and data.
I want to do:
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + other_variables), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + other_variables), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + other_variables), Thalweg$AB3)
# etc.
Depth_AB0 <- (Thalweg$AB0 - Residual_AB0)
Depth_AB1 <- (Thalweg$AB1 - Residual_AB1)
Depth_AB2 <- (Thalweg$AB2 - Residual_AB2)
# etc.
I have tried and subsequently failed at for loops such as:
for (i in nrow(Thalweg)){
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + Stacks_Equation), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + Stacks_Equation), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + Stacks_Equation), Thalweg$AB3)
Residuals <- data.frame(Thalweg$StationID, Residual_AB0, Residual_AB1, Residual_AB2, Residual_AB3)
}
Is there a better way to approach looping through multiple lines of data when I need unique variables saved for each specific row that I am currently calculating? Thank you for any suggestions.
your exact problem is still a mistery to me...
but it looks like you want a double for loop
for(i in 1:nrow(thalweg)){
residual=thalweg[i,"Xdep_Vdep"]
for(j in 2:11){
residual=min(residual,thalweg[i,j])
}
}

R: How to divide a data frame by column values?

Suppose I have a data frame with 3 columns and 10 rows as follows.
# V1 V2 V3
# 10 24 92
# 13 73 100
# 25 91 120
# 32 62 95
# 15 43 110
# 28 54 84
# 30 56 71
# 20 82 80
# 23 19 30
# 12 64 89
I want to create sub-dataframes that divide the original by the values of V1.
For example,
the first data frame will have the rows with values of V1 from 10-14,
the second will have the rows with values of V1 from 15-19,
the third from 20-24, etc.
What would be the simplest way to make this?
So if this is your data
dd<-data.frame(
V1=c(10,13,25,32,15,38,30,20,23,13),
V2=c(24,73,91,62,43,54,56,82,19,64),
V3=c(92,100,120,95,110,84,71,80,30,89)
)
then the easiest way to split is using the split() command. And since you want to split in ranges, you can use the cut() command to create those ranges. A simple split can be done with
ss<-split(dd, cut(dd$V1, breaks=seq(10,35,by=5)-1)); ss
split returns a list where each item is the subsetted data.frame. So to get at the data.frame with the values for 10-14, use ss[[1]], and for 15-19, use ss[[2]] etc.

Count rows for selected column values and remove rows based on count in R

I am new to R and am trying to work on a data frame from a csv file (as seen from the code below). It has hospital data with 46 columns and 4706 rows (one of those columns being 'State'). I made a table showing counts of rows for each value in the State column. So in essence the table shows each state and the number of hospitals in that state. Now what I want to do is subset the data frame and create a new one without the entries for which the state has less than 20 hospitals.
How do I count the occurrences of values in the State column and then remove those that count up to less than 20? Maybe I am supposed to use the table() function, remove the undesired data and put that into a new data frame using something like lappy(), but I'm not sure due to my lack of experience in programming with R.
Any help will be much appreciated. I have seen other examples of removing rows that have certain column values in this site, but not one that does that based on the count of a particular column value.
> outcome <- read.csv("outcome-of-care-measures.csv", colClasses = "character")
> hospital_nos <- table(outcome$State)
> hospital_nos
AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME MI
17 98 77 77 341 72 32 8 6 180 132 1 19 109 30 179 124 118 96 114 68 45 37 134
MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR RI SC SD TN TX UT VA
133 108 83 54 112 36 90 26 65 40 28 185 170 126 59 175 51 12 63 48 116 370 42 87
VI VT WA WI WV WY
2 15 88 125 54 29
Here is one way to do it. Starting with the following data frame :
df <- data.frame(x=c(1:10), y=c("a","a","a","b","b","b","c","d","d","e"))
If you want to keep only the rows with more than 2 occurrences in df$y, you can do :
tab <- table(df$y)
df[df$y %in% names(tab)[tab>2],]
Which gives :
x y
1 1 a
2 2 a
3 3 a
4 4 b
5 5 b
6 6 b
And here is a one line solution with the plyr package :
ddply(df, "y", function(d) {if(nrow(d)>2) d else NULL})

How can I get column data to be added based on a group designation using R?

The data set that I'm working with is similar to the one below (although the example is of a much smaller scale, the data I'm working with is 10's of thousands of rows) and I haven't been able to figure out how to get R to add up column data based on the group number. Essentially I want to be able to get the number of green(s), blue(s), and red(s) added up for all of group 81 and 66 separately and then be able to use that information to calculate percentages.
txt <- "Group Green Blue Red Total
81 15 10 21 46
81 10 10 10 30
81 4 8 0 12
81 42 2 2 46
66 11 9 1 21
66 5 14 5 24
66 7 5 2 14
66 1 16 3 20
66 22 4 2 28"
dat <- read.table(textConnection(txt), sep = " ", header = TRUE)
I've spent a good deal of time trying to figure out how to use some of the functions on my own hoping I would stumble across a proper way to do it, but since I'm such a new basic user I feel like I have hit a wall that I cannot progress past without help.
One way is via aggregate. Assuming your data is in an object x:
aggregate(. ~ Group, data=x, FUN=sum)
# Group Green Blue Red Total
# 1 66 46 48 13 107
# 2 81 71 30 33 134
Both of the answers above are perfect examples of how to address this type of problem. Two other options exist within reshape and plyr
library(reshape)
cast(melt(dat, "Group"), Group ~ ..., sum)
library(plyr)
ddply(dat, "Group", function(x) colSums(x[, -1]))
I would suggest that #Joshua's answer is neater, but two functions you should learn are apply and tapply. If a is your data set, then:
## apply calculates the sum of each row
> total = apply(a[,2:4], 1, sum)
## tapply calculates the sum based on each group
> tapply(total, a$Group, sum)
66 81
107 134

Resources