processing survey multi-choise data in R - r

I need to analyse survey data to get the frequency of a multi question variable. I'm using this R package
I understand that I need to use the 'multi.split' function in order to create the variable that I will be working with. but I need to know how I can make it reference answers that are not in the data-set, meaning answers that were a part of the original question but was not selected during the survey and therefor should be displayed with the value 0.
Example:
I have the following passable answers:
"red", "blue", "green" and "yellow"
and my data is (like in the example):
v <- c("red/blue","green","red/green","blue/red")
when I run this command:
multi.table(multi.split(v))
I get the following result:
n %multi
v.blue 2 50
v.red 3 75
v.green 2 50
but I would like to get:
n %multi
v.blue 2 50
v.red 3 75
v.green 2 50
v.yellow 0 0
any ideas on how can I do that?

I have never used this package before but I'll give it a try.
The function multi-split() produces a data.frame so if you want to add another column before getting the statistics you could do something like the following:
v <- c("red/blue","green","red/green","blue/red")
a <- multi.split(v)
a$v.yellow <- 0
multi.table(a)
## > multi.table(a)
## n %multi
## v.blue 2 50
## v.red 3 75
## v.green 2 50
## v.yellow 0 0
Update
A more generic version would go something like that.
1.wanted.data is a char of column names that you always want in your output.
2. col.to.add are the columns that were not in the a data.frame.
3. Then assign 0 to the columns that were not present.
4. Finally order the columns so we always have them in the same order.
library(questionr)
v <- c("red/blue","green","red/green","blue/red")
wanted_data <- c("v.red","v.blue","v.green","v.yellow")
a <- multi.split(v)
col.to.add<- wanted_data[!(wanted_data%in% colnames(a) )]
a[col.to.add] <- 0
a[,order(colnames(a))]
multi.table(a)
## > multi.table(a)
## n %multi
## v.blue 2 50
## v.red 3 75
## v.green 2 50
## v.yellow 0 0

Related

R enumerate duplicates in a dataframe with unique value

I have a dataframe containing a set of parts and test results. The parts are tested on 3 sites (North Centre and South). Sometimes those parts are re-tested. I want to eventually create some charts that compare the results from the first time that a part was tested with the second (or third, etc.) time that it was tested, e.g. to look at tester repeatability.
As an example, I've come up with the below code. I've explicitly removed the "Experiment" column from the morley data set, as this is the column I'm effectively trying to recreate. The code works, however it seems that there must be a more elegant way to approach this problem. Any thoughts?
Edit - I realise that the example given was overly simplistic for my actual needs (I was trying to generate a reproducible example as easily as possible).
New example:
part<-as.factor(c("A","A","A","B","B","B","A","A","A","C","C","C"))
site<-as.factor(c("N","C","S","C","N","S","N","C","S","N","S","C"))
result<-c(17,20,25,51,50,49,43,45,47,52,51,56)
data<-data.frame(part,site,result)
data$index<-1
repeat {
if(!anyDuplicated(data[,c("part","site","index")]))
{ break }
data$index<-ifelse(duplicated(data[,1:2]),data$index+1,data$index)
}
data
part site result index
1 A N 17 1
2 A C 20 1
3 A S 25 1
4 B C 51 1
5 B N 50 1
6 B S 49 1
7 A N 43 2
8 A C 45 2
9 A S 47 2
10 C N 52 1
11 C S 51 1
12 C C 56 1
Old example:
#Generate a trial data frame from the morley dataset
df<-morley[,c(2,3)]
#Set up an iterative variable
#Create the index column and initialise to 1
df$index<-1
# Loop through the dataframe looking for duplicate pairs of
# Runs and Indices and increment the index if it's a duplicate
repeat {
if(!anyDuplicated(df[,c(1,3)]))
{ break }
df$index<-ifelse(duplicated(df[,c(1,3)]),df$index+1,df$index)
}
# Check - The below vector should all be true
df$index==morley$Expt
We may use diff and cumsum on the 'Run' column to get the expected output. In this method, we are not creating a column of 1s i.e 'index' and also assuming that the sequence in 'Run' is ordered as showed in the OP's example.
indx <- cumsum(c(TRUE,diff(df$Run)<0))
identical(indx, morley$Expt)
#[1] TRUE
Or we can use ave
indx2 <- with(df, ave(Run, Run, FUN=seq_along))
identical(indx2, morley$Expt)
#[1] TRUE
Update
Using the new example
with(data, ave(seq_along(part), part, site, FUN=seq_along))
#[1] 1 1 1 1 1 1 2 2 2 1 1 1
Or we can use getanID from library(splitstackshape)
library(splitstackshape)
getanID(data, c('part', 'site'))[]
I think this is a job for make.unique, with some manipulation.
index <- 1L + as.integer(sub("\\d+(\\.)?","",make.unique(as.character(morley$Run))))
index <- ifelse(is.na(index),1L,index)
identical(index,morley$Expt)
[1] TRUE
Details of your actual data.frame may matter. However, a couple of options working with your example:
#this works if each group starts with 1:
df$index<-cumsum(df$Run==1)
#this is maybe more general, with data.table
require(data.table)
dt<-as.data.table(df)
dt[,index:=seq_along(Speed),by=Run]

Use value in new variable name

I am trying to build a for loop which will step through each site, for that site calculate frequencies of a response, and put those results in a new data frame. Then after the loop I want to be able to combine all of the site data frames so it will look something like:
Site Genus Freq
1 A 50
1 B 30
1 C 20
2 A 70
2 B 10
2 C 20
But to do this I need my names (of vectors, dataframes) to change each time through the loop. I think I can do this using the SiteNum variable, but how do I insert it into new variable names? The way I tried (below) treats it like part of the string, doesn't insert the value for the name.
I feel like what I want to use is a placeholder %, but I don't know how to do that with variable names.
> SiteNum <- 1
> for (Site in CoralSites){
> Csub_SiteNum <- subset(dfrmC, Site==CoralSites[SiteNum])
> CGrfreq_SiteNum <- numeric(length(CoralGenera))
> for (Genus in CoralGenera){
> CGrfreq_SiteNum[GenusNum] <- mean(dfrmC$Genus == CoralGenera[GenusNum])*100
> GenusNum <- GenusNum + 1
> }
> names(CGrfreq_SiteNum) <- c(CoralGenera)
> Site_SiteNum <- c(Site)
> CG_SiteNum <- data.frame(CoralGenera,CGrfreq_SiteNum,Site_SiteNum)
> SiteNum <- SiteNum + 1
> }
Your question as stated asks how you can create a bunch of variables, e.g. CGrfreq_1, CGrfreq_2, ..., where the name of the variable indicates the site number that it corresponds to (1, 2, ...). While you can do such a thing with functions like assign, it is not good practice for a few reasons:
It makes your code to generate the variables more complicated because it will be littered with calls to assign and get and paste0.
It makes your data more difficult to manipulate afterwards -- you'll need to (either manually or programmatically) identify all the variables of a certain type, grab their values with get or mget, and then do something with them.
Instead, you'll find it easier to work with other R functions that will perform the aggregation for you. In this case you're looking to generate for each Site/Genus pairing the percentage of data points at the site with the particular genus value. This can be done in a few lines of code with the aggregate function:
# Sample data:
(dat <- data.frame(Site=c(rep(1, 5), rep(2, 5)), Genus=c(rep("A", 3), rep("B", 6), "A")))
# Site Genus
# 1 1 A
# 2 1 A
# 3 1 A
# 4 1 B
# 5 1 B
# 6 2 B
# 7 2 B
# 8 2 B
# 9 2 B
# 10 2 A
# Calculate frequencies
dat$Freq <- 1
res <- aggregate(Freq~Genus+Site, data=dat, sum)
res$Freq <- 100 * res$Freq / table(dat$Site)[as.character(res$Site)]
res
# Genus Site Freq
# 1 A 1 60
# 2 B 1 40
# 3 A 2 20
# 4 B 2 80

setting up a counter in a R simulation

I would like to do a simulation in R. I would like to set up a loop using some large number of trials. Specifically I would like to use a normal distribution with known mean, Standard deviation and N = 9. I would like to set up a counter which counts the number of times on of the replicates goes below (or above) a certain value. Also I would like to see a histogram of the data generated.
Not a big fan of loops, so I'd do something like this:
func<-function(n){
counter=0
x<-rnorm(1,0,1)
if(x>2|x<(-2)) counter<-1
return(c(n,x,counter))
}
n=1:1000
sum(do.call(rbind,lapply(n,func))[,3])
> sum(do.call(rbind,lapply(n,func))[,3])
[1] 41
> sum(do.call(rbind,lapply(n,func))[,3])
[1] 43
> sum(do.call(rbind,lapply(n,func))[,3])
[1] 43
> sum(do.call(rbind,lapply(n,func))[,3])
[1] 39
while the do.call(rbind,lapply(n,func)) will provide you with the actual data you need to make the histogram of the numbers created:
dat<-data.frame(do.call(rbind,lapply(n,func)))
names(dat)<-c("n","x","counter")
head(dat)
n x counter
1 1 -0.6591145 0
2 2 1.8163984 0
3 3 -2.0291848 1
4 4 0.3309398 0
5 5 -0.8214298 0
6 6 0.5275238 0
Try something along these lines.
#in this structure each row in the matrix is a sim rep
sim.data<- matrix(rnorm(9*1000,0,1),1000,9)
#this counts number of observations below threshold for each rep
below <- apply(sim.data, 1, function(x) sum(x<0.60))
hist(below)

Subset based on granularity and average values

I have large data-frame consists of two columns. I want to calculate the average of the second column values for each subset of the first column. The subset of the first column is based on a specified granularity. For example, for the following data-frame, df, I want to calculate the average of df$B values for each subset of df$A with an increment(granularity) of 1 for each subset. The results should be in two new columns.
A B expected results newA newB
0.22096 1 0 1.142857
0.33489 1 1 2
0.33655 1 2 4
0.43953 1
0.64933 2
0.86668 1
0.96932 1
1.09342 2
1.58314 2
1.88481 2
2.07654 4
2.34652 3
2.79777 5
This is a simple example, I'm not sure how to loop over the whole data-frame and perform the calculation i.e. the average of the df$B.
tried below to subset, but couldn't figure how to append the results and create final results:
Tried something like :
increment<-1
mx<-max(df$A)
i<-0
newDF<-data.frame()
while(i < mx){
tmp<-subset(df, (A >i & A< (i+increment)))
i<-i+granualrity
}
Not sure about the logic. But I'm sure there is a short way to do the required calculation. Any thoughts?
I would use findInterval for the subset selection (In your example a simple ceiling for each A value should be sufficient, too. But if your increment is different from 1 you need findInterval.) and tapply to calculate the mean:
df <- read.table(textConnection("
A B
0.22096 1
0.33489 1
0.33655 1
0.43953 1
0.64933 2
0.86668 1
0.96932 1
1.09342 2
1.58314 2
1.88481 2
2.07654 4
2.34652 3
2.79777 5"), header=TRUE)
## sort data.frame by column A (needed for findInterval)
df <- df[order(df$A), ]
## define granuality
subsets <- seq(1, max(ceiling(df$A)), by=1) # change the "by" argument for different increments
df$subset <- findInterval(df$A, subsets)
tapply(df$B, df$subset, mean)
# 0 1 2
#1.142857 2.000000 4.000000

Comparing two columns: logical- is value from column 1 also in column 2?

I'm pretty confused on how to go about this. Say I have two columns in a dataframe. One column a numerical series in order (x), the other specifying some value from the first, or -1 (y). These are results from a matching experiment, where the goal is to see if multiple photos are taken of the same individual. In the example below, there 10 photos, but 6 are unique individuals. In the y column, the corresponding x is reported if there is a match. y is -1 for no match (might as well be NAs). If there is more than 2 photos per individual, the match # will be the most recent record (photo 1, 5 and 7 are the same individual below). The group is the time period the photo was take (no matches within a group!). Hopefully I've got this example right:
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(-1,-1,-1,-1,1,-1,1,-1,2,4)
group <- c(1,1,1,2,2,2,3,3,3,3)
DF <- data.frame(x,y,group)
I would like to create a new variable to name the unique individuals, and have a final dataset with a single row per individual (i.e. only have 6 rows instead of 10), that also includes the group information. I.e. if an individual is in all three groups, there could be a value of "111" or if just in the first and last group it would be "101". Any tips?
Thanks for asking about the resulting dataset. I realized my group explanation was bad based on the actual numbers I gave, so I changed the results slightly. Bonus would also be nice to have, but not critical.
name <- c(1,2,3,4,6,8)
group_history <- as.character(c('111','101','100','011','010','001'))
bonus <- as.character(c('1,5,7','2,9','3','4,10','6','8'))
results_I_want <- data.frame(name,group_history,bonus)
My word, more mistakes fixed above...
Using the (updated) example you gave
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(-1,-1,-1,-1,1,-1,1,-1,3,4)
group <- c(1,1,1,2,2,2,3,3,3,3)
DF <- data.frame(x,y,group)
Use the x and y to create a mapping from higher numbers to lower numbers that are the same person. Note that names is a string, despite it be a string of digits.
bottom.df <- DF[DF$y==-1,]
mapdown.df <- DF[DF$y!=-1,]
mapdown <- c(mapdown.df$y, bottom.df$x)
names(mapdown) <- c(mapdown.df$x, bottom.df$x)
We don't know how many times it might take to get everything down to the lowest number, so have to use a while loop.
oldx <- DF$x
newx <- mapdown[as.character(oldx)]
while(any(oldx != newx)) {
oldx = newx
newx = mapdown[as.character(oldx)]
}
The result is the group it belongs to, names by the lowest number of that set.
DF$id <- unname(newx)
Getting the group membership is harder. Using reshape2 to convert this into wide format (one column per group) where the column is "1" if there was something in that one and "0" if not.
library("reshape2")
wide <- dcast(DF, id~group, value.var="id",
fun.aggregate=function(x){if(length(x)>0){"1"}else{"0"}})
Finally, paste these "0"/"1" memberships together to get the grouping variable you described.
wide$grouping = apply(wide[,-1], 1, paste, collapse="")
The result:
> wide
id 1 2 3 grouping
1 1 1 1 1 111
2 2 1 0 0 100
3 3 1 0 1 101
4 4 0 1 1 011
5 6 0 1 0 010
6 8 0 0 1 001
No "bonus" yet.
EDIT:
To get the bonus information, it helps to redo the mapping to keep everything. If you have a lot of cases, this could be slow.
Replace the oldx/newx part with:
iterx <- matrix(DF$x, ncol=1)
iterx <- cbind(iterx, mapdown[as.character(iterx[,1])])
while(any(iterx[,ncol(iterx)]!=iterx[,ncol(iterx)-1])) {
iterx <- cbind(iterx, mapdown[as.character(iterx[,ncol(iterx)])])
}
DF$id <- iterx[,ncol(iterx)]
To generate the bonus data, then you can use
bonus <- tapply(iterx[,1], iterx[,ncol(iterx)], paste, collapse=",")
wide$bonus <- bonus[as.character(wide$id)]
Which gives:
> wide
id 1 2 3 grouping bonus
1 1 1 1 1 111 1,5,7
2 2 1 0 0 100 2
3 3 1 0 1 101 3,9
4 4 0 1 1 011 4,10
5 6 0 1 0 010 6
6 8 0 0 1 001 8
Note this isn't same as your example output, but I don't think your example output is right (how can you have a grouping_history of "000"?)
EDIT:
Now it agrees.
Another solution for bonus variable
f_bonus <- function(data=df){
data_a <- subset(data,y== -1,select=x)
data_a$pos <- seq(nrow(data_a))
data_b <- subset(df,y!= -1,select=c(x,y))
data_b$pos <- match(data_b$y, data_a$x)
data_t <- rbind(data_a,data_b[-2])
data_t <- with(data_t,tapply(x,pos,paste,sep="",collapse=","))
return(data_t)
}

Resources