Problem solved, solution added at bottom of posting!
I'd like to know how to "fill" a data frame by inserting rows in between existing rows (not appending to the end).
My situation is following:
I have a data set with about 1700 cases and 650 variables
Certain variables have possible answer categories from 0 to 100 (question was: "How many percent..." -> people could fill in from 0 to 100)
Now I want to show the distribution of one of those variables (let's call it var) in a geom_area().
Problem:
1) I need an X-axis ranging from 0 to 100
2) Not all possible percentage values in var were chosen, for instance I have 30 times the answer "20%", but no answer "19%". For the x-Axis this means, the y-Value at x-position 19 is "0", the y-value at x-position 20 is "30".
To prepare my data (this one variable) for plotting it with ggplot, I transformend it via the table function:
dummy <- as.data.frame(table(var))
Now I have a column "Var1" with the answer categories and a column "Freq" with the counts of each answer categorie.
In total, I have 57 rows, which means that 44 possible answers (values from 0 to 100 percent) were not stated.
Example (of my dataframe), "Var1" contains the given answers, "Freq" the counts:
Var1 Freq
1 0 1
2 1 16
3 2 32
4 3 44
5 4 14
...
15 14 1
16 15 169 # <-- See next row and look at "Var1"
17 17 2 # <-- "16%" was never given as answer
Now my question is: How can I create a new data frame which inserts a row after row 16 (with "Var1"=15) where I can set "Var1" to 16 and "Freq" to 0?
Var1 Freq
...
15 14 1
16 15 169
17 16 0 # <-- This line I like to insert
18 17 2
I've already tried something like this:
dummy_x <- NULL
dummy_y <- NULL
for (k in 0:100) {
pos <- which(dummy$Var1==k)
if (!is.null(pos)) {
dummy_x <- rbind(dummy_x, c(k))
dummy_y <- rbind(dummy_y, dummy$Freq[pos])
}
else {
dummy_x <- rbind(dummy_x, c(k))
dummy_y <- rbind(dummy_y, 0)
}
}
newdataframe <- data.frame(cbind(dummy_x), cbind(dummy_y))
which results in the error that dummy_x has 101 values (from 0 to 101, correct), but dummy_y only contains 56 rows?
The result should be plotted like this:
plot(ggplot(newdataframe, aes(x=Var1, y=Freq)) +
geom_area(fill=barcolors, alpha=0.3) +
geom_line() +
labs(title=fragetitel, x=NULL, y=NULL))
Thanks in advance,
Daniel
Solution for this problem
plotFreq <- function(var, ftitle=NULL, fcolor="blue") {
# create data frame from frequency table of var
# to get answer categorie and counts in separate columns
dummyf <- as.data.frame(table(var))
# rename to "x-axis" and "y-axis"
names(dummyf) <- c("xa", "ya")
# transform $xa from factor to numeric
dummyf$xa <- as.numeric(as.character(dummyf$xa))
# get maximum x-value for graph
maxval <- max(dummyf$xa)
# Create a vector of zeros
frq <- rep(0,maxval)
# Replace the values in freq for those indices which equal dummyf$xa
# by dummyf$ya so that remaining indices are ones which you
# intended to insert
frq[dummyf$xa] <- dummyf$ya
# create new data frame
newdf <- as.data.frame(cbind(var = 1:maxval, frq))
# print plot
ggplot(newdf, aes(x=var, y=frq)) +
# fill area
geom_area(fill=fcolor, alpha=0.3) +
# outline
geom_line() +
# no additional labels on x- and y-axis
labs(title=ftitle, x=NULL, y=NULL)
}
I think this is much simpler solution. Looping is not necessary. Idea is to create a vector of size of desired result, with all values set to zero and then replace appropriate value with non zero values from frequency table.
> #Let's create sample data
> set.seed(12345)
> var <- sample(100, replace=TRUE)
>
>
> #Lets create frequency table
> x <- as.data.frame(table(var))
> x$var <- as.numeric(as.character(x$var))
> head(x)
var Freq
1 1 3
2 2 1
3 4 1
4 5 2
5 6 1
6 7 2
> #Create a vector of 0s
> freq <- rep(0, 100)
> #Replace the values in freq for those indices which equal x$var by x$Freq so that remaining
> #indices are ones which you intended to insert
> freq[x$var] <- x$Freq
> head(freq)
[1] 3 1 0 1 2 1
> #cbind data together
> freqdf <- as.data.frame(cbind(var = 1:100, freq))
> head(freqdf)
var freq
1 1 3
2 2 1
3 3 0
4 4 1
5 5 2
6 6 1
try something like this
insertRowToDF<-function(X,index_after,vector_to_insert){
stopifnot(length(vector_to_insert) == ncol(X)); # to check valid row to be inserted
X<-rbind(X[1:index_after,],vector_to_insert,X[(index_after+1):nrow(X),]);
row.names(X)<-1:nrow(X);
return (X);
}
you can call it with
df<-insertRowToDF(df,16,c(16,0)); # inserting the values (16,0) after the 16th row
This is the Aditya's code plus some conditions to handle special cases:
insertRowToDF<-function(X,index_after,vector_to_insert){
stopifnot(length(vector_to_insert) == ncol(X)); # to check valid row to be inserted
if (index_after != 0) {
if (dim(X)[1] != index_after) {
X <- rbind(X[1:index_after,], vector_to_insert, X[(index_after+1):nrow(X),]);
} else {
X <- rbind(X[1:index_after,], vector_to_insert);
}
} else {
if (dim(X)[1] != index_after) {
X <- rbind(vector_to_insert, X[(1):nrow(X),]);
} else {
X <- rbind(vector_to_insert);
}
}
row.names(X)<-1:nrow(X);
return (X);
}
Related
I am trying to mutate() a 0 or 1 at a specific position in a column. Normally mutate() just mutates the whole column but I want to check conditions and then place a value at a specific position. I tried to use something like an index. Hear is an example: I have values and want to compare them one by one. compare 10 to 16, 16 to 9 and so on. The criteria is: Are value 1 and 2 either both in a or not in a, or is one in a and the other value is not. I wrote down an approach but it seems like mutate does not allow to use TaskS[i+1].
Thanks for your help!
Index
Values
TaskS
1
10
2
16
1
3
9
1
4
8
0
a <- c(1:10)
data_time_filter <- mutate(data_time_filter, TaskS = '')
for (i in 1:40){
current <- data_time_filter$Trial_Id[i] %in% a
adjacent <- data_time_filter$Trial_Id[i+1] %in% a
if (current == adjacent){
data_time_filter <- mutate(data_time_filter, TaskS[i+1] = 0)
}
else if (current != adjacent){
data_time_filter <- mutate(data_time_filter, TaskS[i+1] = 1)
}
}
I am not really sure if I understand your question correctly but I will try to help anyway.
In my approach I have used a user made function in combination with sapply. I believe to work mutate correctly you need an vector output which you won't get with a loop.
So, here is what I did:
# Recreate df
data_time_filter <- data.frame(
index = 1:4,
Values = c(10, 16, 9, 8)
)
# Create filter
ff <- c(1:10)
# Add empty TakS column
data_time_filter <- data_time_filter %>%
mutate(TaskS = '')
# Define a function
abc <- function(data, filter){
l <- length(data)
sapply(1:l, function(x){
if(x == 1){
""
} else {
current <- data[x-1] %in% filter
adjacent <- data[x] %in% filter
if(current == adjacent){
0
} else {
1
}
}
})
}
This approach will let you use mutate:
> data_time_filter
index Values TaskS
1 1 10
2 2 16
3 3 9
4 4 8
> data_time_filter %>%
mutate(TaskS = abc(Values, ff))
index Values TaskS
1 1 10
2 2 16 1
3 3 9 1
4 4 8 0
You could even skip making placeholder TaskS column and create a new one:
> data_time_filter %>%
mutate(TskS_new = abc(Values, ff))
index Values TaskS TskS_new
1 1 10
2 2 16 1
3 3 9 1
4 4 8 0
I would like to use the vector:
time.int<-c(1,2,3,4,5) #vector to be use as a "guide"
and the database:
time<-c(1,1,1,1,5,5,5)
value<-c("s","s","s","t","d","d","d")
dat1<- as.data.frame(cbind(time,value))
to create the following vector, which I can then add to the first vector "time.int" into a second database.
freq<-c(4,0,0,0,3) #wished result
This vector is the sum of the events that belong to each time interval, there are four 1 in "time" so the first value gets a four and so on.
Potentially I would like to generalize it so that I can decide the interval, for example saying sum in a new vector the events in "times" each 3 numbers of time.int.
EDIT for generalization
time.int<-c(1,2,3,4,5,6)
time<-c(1,1,1,2,5,5,5,6)
value<-c("s","s","s","t", "t","d","d","d")
dat1<- data.frame(time,value)
let's say I want it every 2 seconds (every 2 time.int)
freq<-c(4,0,4) #wished result
or every 3
freq<-c(4,4) #wished result
I know how to do that in excel, with a pivot table.
sorry if a duplicate I could not find a fitting question on this website, I do not even know how to ask this and where to start.
The following will produce vector freq.
freq <- sapply(time.int, function(x) sum(x == time))
freq
[1] 4 0 0 0 3
BTW, don't use the construct as.data.frame(cbind(.)). Use instead
dat1 <- data.frame(time,value))
In order to generalize the code above to segments of time.int of any length, I believe the following function will do it. Note that since you've changed the data the output for n == 1 is not the same as above.
fun <- function(x, y, n){
inx <- lapply(seq_len(length(x) %/% n), function(m) seq_len(n) + n*(m - 1))
sapply(inx, function(i) sum(y %in% x[i]))
}
freq1 <- fun(time.int, time, 1)
freq1
[1] 3 1 0 0 3 1
freq2 <- fun(time.int, time, 2)
freq2
[1] 4 0 4
freq3 <- fun(time.int, time, 3)
freq3
[1] 4 4
We can use the table function to count the event number and use merge to create a data frame summarizing the information. event_dat is the final output.
# Create example data
time.int <- c(1,2,3,4,5)
time <- c(1,1,1,1,5,5,5)
# Count the event using table and convert to a data frame
event <- as.data.frame(table(time))
# Convert the time.int to a data frame
time_dat <- data.frame(time = time.int)
# Merge the data
event_dat <- merge(time_dat, event, by = "time", all = TRUE)
# Replace NA with 0
event_dat[is.na(event_dat)] <- 0
# See the result
event_dat
time Freq
1 1 4
2 2 0
3 3 0
4 4 0
5 5 3
I am trying to get all the indexes that meet a condition in a colum. I've already done this in the case of having one column like this:
# Get a 10% of samples labeled with a 1
indexPositive = sample(which(datafsign$result == 1), nrow(datafsign) * .1)
It is possible to do the same operation vectoriced for any number of columns in one line as well? I imagine that in that case indexPositive would be a list or array with the indexes of each column.
Data
The data frame is as follow:
x y f1 f2 f3 f4
1 76.71655 60.74299 1 1 -1 -1
2 -85.73743 -19.67202 1 1 1 -1
3 75.95698 -27.20154 1 1 1 -1
4 -82.57193 39.30717 1 1 1 -1
5 -45.32161 39.44898 1 1 -1 -1
6 -46.76636 -35.30635 1 1 1 -1
The seed I am using is set.seed(1000000007)
What I want is the set of indexes with value 1. In the case of only one column the result is:
head(indexPositive)
[1] 1398 873 3777 2140 133 3515
Thanks in advance.
Answer
Thanks to #David Arenburg I finally did it. Based on his comment I created this function:
getPercentageOfData <- function(x, condition = 1, percentage = .1){
# Get the percentage of samples that meet condition
#
# Args:
# x: A vector containing the data
# condition: Condition that the data need to satisfy
# percentaje: What percentage of samples to get
#
# Returns:
# Indexes of the percentage of the samples that meet the condition
meetCondition = which(x == condition)
sample(meetCondition, length(meetCondition) * percentage)
}
And then I used like this:
# Get a 10% of samples labeled with a 1 in all 4 functions
indexPositive = lapply(datafunctions[3:6], getPercentageOfData)
# Change 1 by -1
datafunctions$f1[indexPositive$f1] = -1
datafunctions$f2[indexPositive$f2] = -1
datafunctions$f3[indexPositive$f3] = -1
datafunctions$f4[indexPositive$f4] = -1
It would be great to also assign the values -1 to each column at once instead of writing 4 lines, but I do not know how.
You can define your function as follows (you can also add replacement as a partameter)
getPercentageOfData <- function(x, condition = 1, percentage = .1, replacement = -1){
meetCondition <- which(x == condition)
replace(x, sample(meetCondition, length(meetCondition) * percentage), replacement)
}
Then select the columns you want to operate on and update datafunctions directly (without creating indexPositive and then manually updating)
cols <- 3:6
datafunctions[cols] <- lapply(datafunctions[cols], getPercentageOfData)
You can of course play around with the functions parameters within lapply as in (for example)
datafunctions[cols] <- lapply(datafunctions[cols],
getPercentageOfData, percentage = .8, replacement = -100)
I am trying to use a huge dataframe (180000 x 400) to calculate another one that would be much smaller.
I have the following dataframe
df1=data.frame(LOCAT=c(1,2,3,4,5,6),START=c(120,345,765,1045,1347,1879),END=c(150,390,802,1120,1436,1935),CODE1=c(1,1,0,1,0,0),CODE2=c(1,0,0,0,-1,-1))
df1
LOCAT START END CODE1 CODE2
1 1 120 150 1 1
2 2 345 390 1 0
3 3 765 802 0 0
4 4 1045 1120 1 0
5 5 1347 1436 0 -1
6 6 1879 1935 0 -1
This is a sample dataframe. The rows continue until 180000 and the columns are over 400.
What I need to do is create a new dataframe based on each column that tells me the size of each continues "1" or "-1" and returns it with the location, size and value.
Something like this for CODE1:
LOCAT SIZE VALUE
1 1 to 2 270 POS
2 4 to 4 75 POS
And like this for CODE2:
LOCAT SIZE VALUE
1 1 to 1 30 POS
2 5 to 6 588 NEG
Unfortunately I still didn't figure out how to do this. I have been trying several lines of code to develop a function to do this automatically but start to get lost or stuck in loops and it seems that nothing works.
Any help would be appreciated.
Thanks in advance
Below is code that gives you the answer in the exact format that you wanted, except I split your "LOCAT" column into two columns entitled "Starts" and "Stops". This code will work for your entire data frame, no need to replicate it manually for each CODE (CODE1, CODE2, etc).
It assumes that the only non-CODE column have the names "LOCAT" "START" and "END".
# need package "plyr"
library("plyr")
# test2 is the example data frame that you gave in the question
test2 <- data.frame(
"LOCAT"=1:6,
"START"=c(120,345,765, 1045, 1347, 1879),
"END"=c(150,390,803,1120,1436, 1935),
"CODE1"=c(1,1,0,1,0,0),
"CODE2"=c(1,0,0,0,-1,-1)
)
codeNames <- names(test2)[!names(test2)%in%c("LOCAT","START","END")] # the names of columns that correspond to different codes
test3 <- reshape(test2, varying=codeNames, direction="long", v.names="CodeValue", timevar="Code") # reshape so the different codes are variables grouped into the same column
test4 <- test3[,!names(test3)%in%"id"] #remove the "id" column
sss <- function(x){ # sss gives the starting points, stopping points, and sizes (sss) in a data frame
rleX <- rle(x[,"CodeValue"]) # rle() to get the size of consecutive values
stops <- cumsum(rleX$lengths) # cumulative sum to get the end-points for the indices (the second value in your LOCAT column)
starts <- c(1, head(stops,-1)+1) # the starts are the first value in your LOCAT column
ssX0 <- data.frame("Value"=rleX$values, "Starts"=starts, "Stops"=stops) #the starts and stops from X (ss from X)
ssX <- ssX0[ssX0[,"Value"]!=0,] # remove the rows the correspond to CODE_ values that are 0 (not POS or NEG)
# The next 3 lines calculate the equivalent of your SIZE column
sizeX1 <- x[ssX[,"Starts"],"START"]
sizeX2 <- x[ssX[,"Stops"],"END"]
sizeX <- sizeX2 - sizeX1
sssX <- data.frame(ssX, "Size"=sizeX) # Combine the Size to the ssX (start stop of X) data frame
return(sssX) #Added in EDIT
}
answer0 <- ddply(.data=test4, .variables="Code", .fun=sss) # use the function ddply() in the package "plyr" (apply the function to each CODE, why we reshaped)
answer <- answer0 # duplicate the original, new version will be reformatted
answer[,"Value"] <- c("NEG",NA,"POS")[answer0[,"Value"]+2] # reformat slightly so that we have POS/NEG instead of 1/-1
Hopefully this helps, good luck!
Use run-length encoding to determine groups where CODE1 takes the same value.
rle_of_CODE1 <- rle(df1$CODE1)
For convenience, find the points where the value is non-zero, and the lenghts of the corresponding blocks.
CODE1_is_nonzero <- rle_of_CODE1$values != 0
n <- rle_of_CODE1$lengths[CODE1_is_nonzero]
Ignore the parts of df1 where CODE1 is zero.
df1_with_nonzero_CODE1 <- subset(df1, CODE1 != 0)
Define a group based on the contiguous blocks we found with rle.
df1_with_nonzero_CODE1$GROUP <- rep(seq_along(n), times = n)
Use ddply to get summary stats for each group.
summarised_by_CODE1 <- ddply(
df1_with_nonzero_CODE1,
.(GROUP),
summarise,
MinOfLOCAT = min(LOCAT),
MaxOfLOCAT = max(LOCAT),
SIZE = max(END) - min(START)
)
summarised_by_CODE1$VALUE <- ifelse(
rle_of_CODE1$values[CODE1_is_nonzero] == 1,
"POS",
"NEG"
)
summarised_by_CODE1
## GROUP MinOfLOCAT MaxOfLOCAT SIZE VALUE
## 1 1 1 2 270 POS
## 2 3 4 4 75 POS
Now repeat with CODE2.
I'm pretty confused on how to go about this. Say I have two columns in a dataframe. One column a numerical series in order (x), the other specifying some value from the first, or -1 (y). These are results from a matching experiment, where the goal is to see if multiple photos are taken of the same individual. In the example below, there 10 photos, but 6 are unique individuals. In the y column, the corresponding x is reported if there is a match. y is -1 for no match (might as well be NAs). If there is more than 2 photos per individual, the match # will be the most recent record (photo 1, 5 and 7 are the same individual below). The group is the time period the photo was take (no matches within a group!). Hopefully I've got this example right:
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(-1,-1,-1,-1,1,-1,1,-1,2,4)
group <- c(1,1,1,2,2,2,3,3,3,3)
DF <- data.frame(x,y,group)
I would like to create a new variable to name the unique individuals, and have a final dataset with a single row per individual (i.e. only have 6 rows instead of 10), that also includes the group information. I.e. if an individual is in all three groups, there could be a value of "111" or if just in the first and last group it would be "101". Any tips?
Thanks for asking about the resulting dataset. I realized my group explanation was bad based on the actual numbers I gave, so I changed the results slightly. Bonus would also be nice to have, but not critical.
name <- c(1,2,3,4,6,8)
group_history <- as.character(c('111','101','100','011','010','001'))
bonus <- as.character(c('1,5,7','2,9','3','4,10','6','8'))
results_I_want <- data.frame(name,group_history,bonus)
My word, more mistakes fixed above...
Using the (updated) example you gave
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(-1,-1,-1,-1,1,-1,1,-1,3,4)
group <- c(1,1,1,2,2,2,3,3,3,3)
DF <- data.frame(x,y,group)
Use the x and y to create a mapping from higher numbers to lower numbers that are the same person. Note that names is a string, despite it be a string of digits.
bottom.df <- DF[DF$y==-1,]
mapdown.df <- DF[DF$y!=-1,]
mapdown <- c(mapdown.df$y, bottom.df$x)
names(mapdown) <- c(mapdown.df$x, bottom.df$x)
We don't know how many times it might take to get everything down to the lowest number, so have to use a while loop.
oldx <- DF$x
newx <- mapdown[as.character(oldx)]
while(any(oldx != newx)) {
oldx = newx
newx = mapdown[as.character(oldx)]
}
The result is the group it belongs to, names by the lowest number of that set.
DF$id <- unname(newx)
Getting the group membership is harder. Using reshape2 to convert this into wide format (one column per group) where the column is "1" if there was something in that one and "0" if not.
library("reshape2")
wide <- dcast(DF, id~group, value.var="id",
fun.aggregate=function(x){if(length(x)>0){"1"}else{"0"}})
Finally, paste these "0"/"1" memberships together to get the grouping variable you described.
wide$grouping = apply(wide[,-1], 1, paste, collapse="")
The result:
> wide
id 1 2 3 grouping
1 1 1 1 1 111
2 2 1 0 0 100
3 3 1 0 1 101
4 4 0 1 1 011
5 6 0 1 0 010
6 8 0 0 1 001
No "bonus" yet.
EDIT:
To get the bonus information, it helps to redo the mapping to keep everything. If you have a lot of cases, this could be slow.
Replace the oldx/newx part with:
iterx <- matrix(DF$x, ncol=1)
iterx <- cbind(iterx, mapdown[as.character(iterx[,1])])
while(any(iterx[,ncol(iterx)]!=iterx[,ncol(iterx)-1])) {
iterx <- cbind(iterx, mapdown[as.character(iterx[,ncol(iterx)])])
}
DF$id <- iterx[,ncol(iterx)]
To generate the bonus data, then you can use
bonus <- tapply(iterx[,1], iterx[,ncol(iterx)], paste, collapse=",")
wide$bonus <- bonus[as.character(wide$id)]
Which gives:
> wide
id 1 2 3 grouping bonus
1 1 1 1 1 111 1,5,7
2 2 1 0 0 100 2
3 3 1 0 1 101 3,9
4 4 0 1 1 011 4,10
5 6 0 1 0 010 6
6 8 0 0 1 001 8
Note this isn't same as your example output, but I don't think your example output is right (how can you have a grouping_history of "000"?)
EDIT:
Now it agrees.
Another solution for bonus variable
f_bonus <- function(data=df){
data_a <- subset(data,y== -1,select=x)
data_a$pos <- seq(nrow(data_a))
data_b <- subset(df,y!= -1,select=c(x,y))
data_b$pos <- match(data_b$y, data_a$x)
data_t <- rbind(data_a,data_b[-2])
data_t <- with(data_t,tapply(x,pos,paste,sep="",collapse=","))
return(data_t)
}