I have 5 categorical variables: age(5 levels), sex(2 levels), zone(4 levels), qmat(5 levels), and qsoc(5 levels) for a total of 1000 unique combinations. Each unique combination has a corresponding data value (e.g. population size). I would like to assign this data to a 1000 x 6 table where the first five columns contain the indices of age, sex, zone, qmat, qsoc and the 6th column holds the data value.
I would like to avoid using nested for loops which are inefficient in R (some of my datasets will have more than 1000 unique combinations). I know there exist many tools in R for parallel operations (but am not familiar with them). Is there an efficient way to perform the above variable assignment using parallel/vector operations? Any suggestions or references would be appreciated.
It's hard to understand how the original data you have looks like, but assuming that you have your data on a data frame, you may want to use aggregate().
# simulating a data frame
set.seed(1)
N = 9000
df = data.frame(pop=rnorm(N),
age=sample(1:5, N, replace=T),
sex=sample(1:2, N, replace=T)
)
# 'aggregate' this data frame by 'age' and 'sex'
newData = aggregate(pop ~ age + sex, data=df, FUN=sum)
The R function expand.grid() will solve my problem e.g.
expand.grid(list(age,sex,zone,qmat,qsoc))
Thanks for all the responses and I apologize for any possible vagueness in the wording of my question.
Related
I have been trying to create a contingency table in R with percentage distribution of education for husbands (6 category) and wives (6 category) BY marriage cohort (total 4 cohorts). My ideal is something like this: IdealTable.
However, what I have been able to get at max is: CurrentTable.
I am not able to figure out how to convert my row and column sums to percentages (similar to the ideal). The current code that I am using is:
three.table = addmargins(xtabs(~MarriageCohort + HerEdu + HisEdu, data = mydata))
ftable(three.table)
Is there a way I can turn the row and column sums into percentages for each marriage cohort?
How can I add labels to this and export the ftable?
I am relatively new to R and tried to find solutions to my questions above on google, but havent been successful. Posting my query on this platform for the first time and any help with this will be greatly appreciated! Thank you!
One approach would be to create separate xtab runs for each MarriageCohort:
Cohorts <- lapply( mydata, mydata["MarriageCohort"],
function(z) xtabs( ~HerEdu + HisEdu, data = z) )
Then get totals in each Cohorts item before dividing the cohort addmargins(.) result by those totals and multiplying by 100 to get percent values:
divCohorts <- lapply(Cohorts, function(tbl) 100*addmargins(tbl)/sum(tbl) )
Then you will need to clean those items up to your desires. You have not included data so the cleanup remains your responsibility. (I did not use sapply because that could give you a big matrix that might be difficult to manage, but you could try it and see if you in the second stepwere satisfied with that approach.)
I have a dataset containing rows of unique identifiers. Each unique identifier occupies several rows b/c each person (identifier) has different ratings. For example unique identifier 1 may have a rating for Goal A, Goal B, Goal C, all represented in a separate row.
What would be the best way to find the average for each unique identifier (i.e. for manager 1 (unique identifier 1), what is their average score across Goal A, Goal B and Goal C?
In excel, I'd do this by using the data sort > and check unique identifiers, copy and paste those values at the bottom of the dataset, and find the average using a series of conditional statements. I'm sure there must be a way to do this in R. Would appreciate any help/insight.
I started with this code, but am not sure if this is what I need. I'm filtering by departments (FSO), then asking it to give me a list of unique IDs, and then computing the average for each manager.
df %>% filter(newdept=='FSO') %>%
distinct(ID) %>%
summarize(compmean = mean(CompRating2, na.rm=TRUE))
A base R solution would be to use aggregate:
dat <- data.frame(id=sample(LETTERS, 50, replace=TRUE), score=sample(1:5, 50, replace=TRUE), stringsAsFactors=FALSE)
aggregate(score ~ id, data=dat, mean)
I have a data set in a wide format, consisting of two rows, one with the variable names and one with the corresponding values. The variables represent characteristics of individuals from a sample of size 1000. For instance I have 1000 variables regarding the size of each individual, then 1000 variables with the height, then 1000 variables with the weight etc. Now I would like to run simple regressions (say weight on calorie consumption), the only way I can think of doing this is to declare a vector that contains the 1000 observations of each variable, say for instance:
regressor1=c(mydata$height0, mydata$height1, mydata$height2, mydata$height3, ... mydata$height1000)
But given that I have a few dozen variables and each containing 1000 observations this will become cumbersome. Is there a way to do this with a loop?
I have also thought a about the reshape options of R, but this again will put me in a position where I have to type 1000 variables a few dozen times.
Thank you for your help.
Here is how I would go about your issue. t() will transpose the data for you from many columns to many rows.
Note: t() can be used with a matrix rather than a data frame, I simply coerced to data frame to show my example will work with your data.
# Many columns, 2 rows
x <- as.data.frame(matrix(nrow=2,ncol=1000,seq(1:2000)))
#2 Columns, many rows
t(x)
Based on your comments you are looking to generate vectors.
If you have transposed:
regressor1 <- x[,1]
regressor2 <- x[,2]
If you have not transposed:
regressor1 <- x[1,]
regressor2 <- x[2,]
I have a data frame (760 rows) with two columns, named Price and Size. I would like to put the data into 4/5 groups based on price that would minimize the variance in each group while preserving the order Size (which is in ascending order). The Jenks natural breaks optimization would be an ideal function however it does not take the order of Size into consideration.
Basically, I have data simlar to the following (with more data)
Price=c(90,100,125,100,130,182,125,250,300,95)
Size=c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata=data.frame(Size,Price)
I would like to group data, to minimize the variance of price in each group respecting 1) The Size value: For example, the first two prices 90 and 100 cannot be in a different groups since they are the same size & 2) The order of the Size: For example, If Group One includes observations (Obs) 1-2 and Group Two includes observations 3-9, observation 10 can only enter into group two or three.
Can someone please give me some advice? Maybe there is already some such function that I can’t find?
Is this what you are looking for? With the dplyr package, grouping is quite easy. The %>%can be read as "then do" so you can combine multiple actions if you like.
See http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html for further information.
library("dplyr")
Price <– c(90,100,125,100,130,182,125,250,300,95)
Size <- c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata <- data.frame(Size,Price) %>% # "then"
group_by(Size) # group data by Size column
mydata_mean_sd <- mydata %>% # "then"
summarise(mean = mean(Price), sd = sd(Price)) # calculate grouped
#mean and sd for illustration
I had a similar problem with optimally splitting a day into 4 "load blocks". Adjacent time periods must stick together, of course.
Not an elegant solution, but I wrote my own function that first split up a sorted series at specified break points, then calculates the sum(SDCM) using those break points (using the algorithm underlying the jenks approach from Wiki).
Then just iterated through all valid combinations of break points, and selected the set of points that produced the minimum sum(SDCM).
Would quickly become unmanageable as number of possible breakpoints combinations increases, but it worked for my data set.
I am a new R user and an unexperienced coder and I have a data handling problem. Hopefully someone can help:
I have a data.frame with 3 columns (firm, year, class) and about 50.000 rows. I want to generate and store for every firm a (class x year) matrix with class counts as the elements in the matrix. Every matrix would be automatically named something like firm.name and stored so that I can use them afterwards for computations. Ideally, I'd be able to change the simple class counts into a function of values in columns 4 and 5 (backward and forward citations)
I am looking at 40 firms, 30 years, and about 1500 classes (so many firm-year-class counts are zero).
I realise I can get most of what I need (for counts) by simply using table(class,year,firm) as these columns have the same length. However, I don't know how to either store or access the matrices this function generates...
Any help would be greatly appreciated!
Simon
So, your question is how to deal with a table object?
Example:
#note the assigment operator
mytable <- with(ChickWeight, table(cut(weight, c(0,100,200,Inf)), Diet, Chick))
#access the data for the first chick
mytable[,,1]
#turn the table object into a data.frame
as.data.frame(mytable)