I'm having trouble generating a table with plyr and I hope you can help. If you run the code below you should get a table with proportions at the highest aggregate level of my data (i.e., the whole data set). However, I'd like to get the same table with proportions per item for each school. Thanks for any help. Also, if there's a better way to synthesize this with just dplyr I'm open to it. I'm trying to integrate some of these new packages into my workflow.
# load packages
library(plyr)
library(dplyr)
library(reshape2)
library(tidyr)
library(xtable)
# generate fake Data
set.seed(500)
School <- rep(seq(1:20), 2)
District <- rep(c(rep("East", 10), rep("West", 10)), 2)
Score <- rnorm(40, 100, 15)
Student.ID <- sample(1:1000,8,replace=T)
items <- data.frame(replicate(10, sample(1:4, 40, replace=TRUE)))
items <- data.frame(lapply(items, factor, ordered=TRUE,
levels=1:4,
labels=c("Strongly disagree","Disagree",
"Agree","Strongly Agree")))
school.data <- data.frame(Student.ID, School, District, Score, items)
rm(items)
# code for table
items <- select(school.data, School, X1:X10)
g <- items %>%
gather(Item, response, -School)
# This gives me the aggregate results for the entire data set
foo <- ddply(g, .(Item), function(x) prop.table(table(x$response))) #I stupidly tried .(Item, School) to no avail
xtable(foo)
Try
prop.table(with(g, table(response, Item, School)), margin = 2)
This gives a 4x10x20 array (responses, items, schools). You can use as.data.fame on the result for conversion if needed.
Related
I have a small problem with R.
I have merged together 2 datasets and I have to compute simple ratios between them. The datasets are not that small (18 columns per dataset) and I would like to avoid going by simple brute force.
To give you an example
df <- data.frame(a1= sample(1:100, 10), b1 = sample(1:100, 10), a2= sample(1:100, 10), b2 = sample(1:100,10))
The ratios would simply be a column divided by another one, so in the example it would be c1=a1/b1 and c2=a2/b2. And it could be simply implemented by:
mutate(df, c1=a1/b1, c2=a2/b2)
My question is if there is a way to make this process automatic and instruct R to perform a mutate without manually inputting all the formulas such that it computes c1,c2,c3.... c18.
I've tried setting up a for cycle with subsets on the columns but I can't seem to make it work within tidyverse.
Thank you in advance
One simple base R way would be to do something like:
for (i in 1:2) {
df[paste0("c", i)] <- df[paste0("a", i)] / df[paste0("b", i)]
}
But it's dependent on what pattern your actual variable names have.
Another way using tidyverse tools (but there's probably a more elegant way of doing this):
library(tidyverse)
library(glue)
map_dfc(1:2, function(x) {
transmute(df, "c{x}" := .data[[glue("a{x}")]] / .data[[glue("b{x}")]])
}) %>%
bind_cols(df) %>%
relocate(starts_with("c"), .after = last_col())
and thanks to all in advance.
I have the following data:
set.seed(123)
data <- data.frame (name=LETTERS[sample(1:26, 500, replace=T)],present=sample(0:1,500,replace = T))
And I want to quickly calculate the percentage of present observations (1's) for each letter. I can do it manually, but I believe there is an easier way to do this:
library(dplyr)
A <- filter(data, name=="A" & present==1)
A2 <- filter(data, name=="A")
data$Percentage[data$name=="A"] <- nrow(A)/nrow(A2)
And so on until I arrive to "Z".
Can I make this task automatically without having to change the values of the "name" colum manually?
Best regards,
We can use prop.table with table to get the proportion
prop.table(table(data), 1)[,2]
To add it as a column, we can expand it by matching with the 'names'
data$Percentage <- prop.table(table(data), 1)[,2][as.character(data$name)]
Or as #Lars Lau Raket suggested, we don't need to convert to character
prop.table(table(data), 1)[,2][data$name]
If we need to create a column
library(dplyr)
data %>%
group_by(name) %>%
mutate(Percentage = mean(present==1))
I have a data.frame with a set of records and as variables different measurements. I would like to create a new data.frame containing the amount of records having a specific measurement value for each measurement. Basically what I am trying to do is:
record <- c("r1", "r2", "r3")
firstMeasurement <- c(15, 10, 10)
secondMeasurement <- c(2, 4, 2)
df <- data.frame(record, firstMeasurement, secondMeasurement)
measurements <- c(colnames(df[2:3]))
measuramentsAggregate <- lapply(measurements, function(i)
aggregate(record~i, df, FUN=length))
I am getting really funny errors and I don't understand why. Can anyone help me?
Many thanks!
I think this is what you want
library(dplyr)
agg.measurements <- df %>% group_by(firstMeasurement) %>% summarise(records=n())
That should do it for the one.
If you want the number of records with specific firstMeasurement:
table(df$firstMeasurement)
Likewise for the secondMeasurement. I am not sure how the data.frame you are trying to create might look.
I have a very basic R question but I am having a hard time trying to get the right answer. I have a data frame that looks like this:
species <- "ABC"
ind <- rep(1:4, each = 24)
hour <- rep(seq(0, 23, by = 1), 4)
depth <- runif(length(ind), 1, 50)
df <- data.frame(cbind(species, ind, hour, depth))
df$depth <- as.numeric(df$depth)
What I would like it to select AND replace all the rows where depth < 10 (for example) with zero, but I want to keep all the information associated to those rows and the original dimensions of the data frame.
I have try the following but this does not work.
df[df$depth<10] <- 0
Any suggestions?
# reassign depth values under 10 to zero
df$depth[df$depth<10] <- 0
(For the columns that are factors, you can only assign values that are factor levels. If you wanted to assign a value that wasn't currently a factor level, you would need to create the additional level first:
levels(df$species) <- c(levels(df$species), "unknown")
df$species[df$depth<10] <- "unknown"
I arrived here from a google search, since my other code is 'tidy' so leaving the 'tidy' way for anyone who else who may find it useful
library(dplyr)
iris %>%
mutate(Species = ifelse(as.character(Species) == "virginica", "newValue", as.character(Species)))
I have a data frame with annual exports of firms to different countries in different years. My problem is i need to create a variable that says, for each year, how many firms there are in each country. I can do this perfectly with a "tapply" command, like
incumbents <- tapply(id, destination-year, function(x) length(unique(x)))
and it works just fine. My problem is that incumbents has length length(destination-year), and I need it to have length length(id) -there are many firms each year serving each destination-, to use it in a subsequent regression (of course, in a way that matches the year and the destination). A "for" loop can do this, but it is very time-consuming since the database is kind of huge.
Any suggestions?
You don't provide a reproducible example, so I can't test this, but you should be able to use ave:
incumbents <- ave(id, destination-year, FUN=function(x) length(unique(x)))
Just "merge" the tapply summary back in with the original data frame with merge.
Since you didn't provide example data, I made some. Modify accordingly.
n = 1000
id = sample(1:10, n, replace=T)
year = sample(2000:2011, n, replace=T)
destination = sample(LETTERS[1:6], n, replace=T)
`destination-year` = paste(destination, year, sep='-')
dat = data.frame(id, year, destination, `destination-year`)
Now tabulate your summaries. Note how I reformatted to a data frame and made the names match the original data.
incumbents = tapply(id, `destination-year`, function(x) length(unique(x)))
incumbents = data.frame(`destination-year`=names(incumbents), incumbents)
Finally, merge back in with the original data:
merge(dat, incumbents)
By the way, instead of combining destination and year into a third variable, like it seems you've done, tapply can handle both variables directly as a list:
incumbents = melt(tapply(id, list(destination=destination, year=year), function(x) length(unique(x))))
Using #JohnColby's excellent example data, I was thinking of something more along the lines of this:
#I prefer not to deal with the pesky '-' in a variable name
destinationYear = paste(destination, year, sep='-')
dat = data.frame(id, year, destination, destinationYear)
#require(plyr)
dat <- ddply(dat,.(destinationYear),transform,newCol = length(unique(id)))
#Or if more speed is required, use data.table
require(data.table)
datTable <- data.table(dat)
datTable <- datTable[,transform(.SD,newCol = length(unique(id))),by = destinationYear]