looping between two vectors for meeting three conditions - r

I have a csv file containing 4 columns of data. I need to select the first column from the csv file which I do like this:
file1<-read.csv("file1.csv",header=TRUE)
x<-file[,1]
The first column contains (x, here) contains row numbers.
x
5
10
54
177
178
182
183
184
185
203
204
205
206
207
208
Now there is another csv file which contains a single column of 365 rows of data
y<-read.csv("data.csv",header=TRUE)
y
0
2.3
0.5
21
0
.
.
.
9.5 #total 365 numbers
This is what I intend to do:
1) From x, chose the first number (which is 5)
2) In y, select the corresponding 5th data point (which is 0) and 4 data point prior to it (which are 21,0.5,2.3,0), then test the following condition respectively
Condition 1: From the 5 data points, if the three out of five are > 0, then print 5 (result of step 1)
Condition 2: If all four of five are >0, then print 5 again
Condition 3: If all of five are >0, then print 5 again
However, if out of three conditions, only the first two are met and third one is not met, then select the second number from x (10 in this case) and again choose the corresponding 10th data point in y and four data points prior to it (6th,7th,8th and 9th) and evaluate them for the third condition (i.e. if all the five numbers - 6th, 7th, 8th,9th and 10th are > 0, I do not need to evaluate the first and second condition which are already met by the previous number from x),then save 10 and stop.
This sounds quite complicated for a feeble mind of mine (as seen by my reputation) and was hoping someone can tell me how to do this in R.
Thanks a lot

Sounds like you need a while loop.
file1 <- data.frame(x=seq(5, 205, by=5))
file1
x <- file1[, 1]
set.seed(123)
file2 <- data.frame(y=rnorm(365))
y <- file2[, 1]
# flags for each condition
cond1 <- FALSE
cond2 <- FALSE
cond3 <- FALSE
k <- 0
while(!cond3) {
k <- k + 1
# select first number
num <- x[k]
# select all y's up to data point
all.y <- y[(num-4):num]
# number of positive values
chk.pos <- length(which(all.y > 0))
# condition 1: check if 3 of 5 are positive
cnt <- 0
if (!cond1 & chk.pos >= 3) {
cnt <- cnt + 1
cond1 <- TRUE
print(num)
}
# condition 2: check if 4 of 5 are positive
if (!cond2 & chk.pos >= 4) {
cnt <- cnt + 1
cond2 <- TRUE
print(num)
}
# condition 3: check if 5 of 5 are positive
if (!cond3 & chk.pos == 5) {
cnt <- cnt + 1
cond3 <- TRUE
print(num)
}
}
for me returns
[1] 5
[1] 15
[1] 70

Related

Find row that matches a range of values

I am trying to get the column importantval for a number that is within a range. I have no clue how to even start this, anyone have any ideas?
data<-data.frame(lower=c(1,4,6,7,7),upper=c(3,5,7,8,9),importantval=c(99,98,97,96,95))
vals<-c(1.14,3.5,7.2,19)
> data
lower upper importantval
1 1 3 99
2 4 5 98
3 6 7 97
4 7 8 96
5 7 9 95
output goal
# 1.14 99
# 3.5 NA
# 7.2 96 <--return the smalller interval (from 7 to 8 is smaller than 7 to 9)
# 19 NA <--doesnt exist so return NA
A simple lapply would do the trick. Identifying the line is relatively easy. The if statement to take only the smaller interval when multiple values work is a bit harder to understand but mostly, if there are more than one possibility, I take the row where the interval is equal to the smallest interval possible.
foo <- function(i) {
res <- data[data$lower < i & data$upper > i, ]
if (nrow(res) > 1) {
res <- res[which(res$upper - res$lower == min(res$upper - res$lower)), ]
}
if (nrow(res) == 0) return(NA)
return(res$importantval)
}
results <- data.frame(vals, sapply(vals, foo))
This assumes that there are no intervals that are of same length. If this is a possibility, you could add return(min(res$importantval)) at the end to get only the smaller value.
If you would want to keep both values, take the results in a list:
results <- lapply(vals, foo)
names(results) <- vals

Create a new data frame based on another dataframe

I am trying to use a huge dataframe (180000 x 400) to calculate another one that would be much smaller.
I have the following dataframe
df1=data.frame(LOCAT=c(1,2,3,4,5,6),START=c(120,345,765,1045,1347,1879),END=c(150,390,802,1120,1436,1935),CODE1=c(1,1,0,1,0,0),CODE2=c(1,0,0,0,-1,-1))
df1
LOCAT START END CODE1 CODE2
1 1 120 150 1 1
2 2 345 390 1 0
3 3 765 802 0 0
4 4 1045 1120 1 0
5 5 1347 1436 0 -1
6 6 1879 1935 0 -1
This is a sample dataframe. The rows continue until 180000 and the columns are over 400.
What I need to do is create a new dataframe based on each column that tells me the size of each continues "1" or "-1" and returns it with the location, size and value.
Something like this for CODE1:
LOCAT SIZE VALUE
1 1 to 2 270 POS
2 4 to 4 75 POS
And like this for CODE2:
LOCAT SIZE VALUE
1 1 to 1 30 POS
2 5 to 6 588 NEG
Unfortunately I still didn't figure out how to do this. I have been trying several lines of code to develop a function to do this automatically but start to get lost or stuck in loops and it seems that nothing works.
Any help would be appreciated.
Thanks in advance
Below is code that gives you the answer in the exact format that you wanted, except I split your "LOCAT" column into two columns entitled "Starts" and "Stops". This code will work for your entire data frame, no need to replicate it manually for each CODE (CODE1, CODE2, etc).
It assumes that the only non-CODE column have the names "LOCAT" "START" and "END".
# need package "plyr"
library("plyr")
# test2 is the example data frame that you gave in the question
test2 <- data.frame(
"LOCAT"=1:6,
"START"=c(120,345,765, 1045, 1347, 1879),
"END"=c(150,390,803,1120,1436, 1935),
"CODE1"=c(1,1,0,1,0,0),
"CODE2"=c(1,0,0,0,-1,-1)
)
codeNames <- names(test2)[!names(test2)%in%c("LOCAT","START","END")] # the names of columns that correspond to different codes
test3 <- reshape(test2, varying=codeNames, direction="long", v.names="CodeValue", timevar="Code") # reshape so the different codes are variables grouped into the same column
test4 <- test3[,!names(test3)%in%"id"] #remove the "id" column
sss <- function(x){ # sss gives the starting points, stopping points, and sizes (sss) in a data frame
rleX <- rle(x[,"CodeValue"]) # rle() to get the size of consecutive values
stops <- cumsum(rleX$lengths) # cumulative sum to get the end-points for the indices (the second value in your LOCAT column)
starts <- c(1, head(stops,-1)+1) # the starts are the first value in your LOCAT column
ssX0 <- data.frame("Value"=rleX$values, "Starts"=starts, "Stops"=stops) #the starts and stops from X (ss from X)
ssX <- ssX0[ssX0[,"Value"]!=0,] # remove the rows the correspond to CODE_ values that are 0 (not POS or NEG)
# The next 3 lines calculate the equivalent of your SIZE column
sizeX1 <- x[ssX[,"Starts"],"START"]
sizeX2 <- x[ssX[,"Stops"],"END"]
sizeX <- sizeX2 - sizeX1
sssX <- data.frame(ssX, "Size"=sizeX) # Combine the Size to the ssX (start stop of X) data frame
return(sssX) #Added in EDIT
}
answer0 <- ddply(.data=test4, .variables="Code", .fun=sss) # use the function ddply() in the package "plyr" (apply the function to each CODE, why we reshaped)
answer <- answer0 # duplicate the original, new version will be reformatted
answer[,"Value"] <- c("NEG",NA,"POS")[answer0[,"Value"]+2] # reformat slightly so that we have POS/NEG instead of 1/-1
Hopefully this helps, good luck!
Use run-length encoding to determine groups where CODE1 takes the same value.
rle_of_CODE1 <- rle(df1$CODE1)
For convenience, find the points where the value is non-zero, and the lenghts of the corresponding blocks.
CODE1_is_nonzero <- rle_of_CODE1$values != 0
n <- rle_of_CODE1$lengths[CODE1_is_nonzero]
Ignore the parts of df1 where CODE1 is zero.
df1_with_nonzero_CODE1 <- subset(df1, CODE1 != 0)
Define a group based on the contiguous blocks we found with rle.
df1_with_nonzero_CODE1$GROUP <- rep(seq_along(n), times = n)
Use ddply to get summary stats for each group.
summarised_by_CODE1 <- ddply(
df1_with_nonzero_CODE1,
.(GROUP),
summarise,
MinOfLOCAT = min(LOCAT),
MaxOfLOCAT = max(LOCAT),
SIZE = max(END) - min(START)
)
summarised_by_CODE1$VALUE <- ifelse(
rle_of_CODE1$values[CODE1_is_nonzero] == 1,
"POS",
"NEG"
)
summarised_by_CODE1
## GROUP MinOfLOCAT MaxOfLOCAT SIZE VALUE
## 1 1 1 2 270 POS
## 2 3 4 4 75 POS
Now repeat with CODE2.

Insert "empty" rows (filling up) in data frame in R

Problem solved, solution added at bottom of posting!
I'd like to know how to "fill" a data frame by inserting rows in between existing rows (not appending to the end).
My situation is following:
I have a data set with about 1700 cases and 650 variables
Certain variables have possible answer categories from 0 to 100 (question was: "How many percent..." -> people could fill in from 0 to 100)
Now I want to show the distribution of one of those variables (let's call it var) in a geom_area().
Problem:
1) I need an X-axis ranging from 0 to 100
2) Not all possible percentage values in var were chosen, for instance I have 30 times the answer "20%", but no answer "19%". For the x-Axis this means, the y-Value at x-position 19 is "0", the y-value at x-position 20 is "30".
To prepare my data (this one variable) for plotting it with ggplot, I transformend it via the table function:
dummy <- as.data.frame(table(var))
Now I have a column "Var1" with the answer categories and a column "Freq" with the counts of each answer categorie.
In total, I have 57 rows, which means that 44 possible answers (values from 0 to 100 percent) were not stated.
Example (of my dataframe), "Var1" contains the given answers, "Freq" the counts:
Var1 Freq
1 0 1
2 1 16
3 2 32
4 3 44
5 4 14
...
15 14 1
16 15 169 # <-- See next row and look at "Var1"
17 17 2 # <-- "16%" was never given as answer
Now my question is: How can I create a new data frame which inserts a row after row 16 (with "Var1"=15) where I can set "Var1" to 16 and "Freq" to 0?
Var1 Freq
...
15 14 1
16 15 169
17 16 0 # <-- This line I like to insert
18 17 2
I've already tried something like this:
dummy_x <- NULL
dummy_y <- NULL
for (k in 0:100) {
pos <- which(dummy$Var1==k)
if (!is.null(pos)) {
dummy_x <- rbind(dummy_x, c(k))
dummy_y <- rbind(dummy_y, dummy$Freq[pos])
}
else {
dummy_x <- rbind(dummy_x, c(k))
dummy_y <- rbind(dummy_y, 0)
}
}
newdataframe <- data.frame(cbind(dummy_x), cbind(dummy_y))
which results in the error that dummy_x has 101 values (from 0 to 101, correct), but dummy_y only contains 56 rows?
The result should be plotted like this:
plot(ggplot(newdataframe, aes(x=Var1, y=Freq)) +
geom_area(fill=barcolors, alpha=0.3) +
geom_line() +
labs(title=fragetitel, x=NULL, y=NULL))
Thanks in advance,
Daniel
Solution for this problem
plotFreq <- function(var, ftitle=NULL, fcolor="blue") {
# create data frame from frequency table of var
# to get answer categorie and counts in separate columns
dummyf <- as.data.frame(table(var))
# rename to "x-axis" and "y-axis"
names(dummyf) <- c("xa", "ya")
# transform $xa from factor to numeric
dummyf$xa <- as.numeric(as.character(dummyf$xa))
# get maximum x-value for graph
maxval <- max(dummyf$xa)
# Create a vector of zeros
frq <- rep(0,maxval)
# Replace the values in freq for those indices which equal dummyf$xa
# by dummyf$ya so that remaining indices are ones which you
# intended to insert
frq[dummyf$xa] <- dummyf$ya
# create new data frame
newdf <- as.data.frame(cbind(var = 1:maxval, frq))
# print plot
ggplot(newdf, aes(x=var, y=frq)) +
# fill area
geom_area(fill=fcolor, alpha=0.3) +
# outline
geom_line() +
# no additional labels on x- and y-axis
labs(title=ftitle, x=NULL, y=NULL)
}
I think this is much simpler solution. Looping is not necessary. Idea is to create a vector of size of desired result, with all values set to zero and then replace appropriate value with non zero values from frequency table.
> #Let's create sample data
> set.seed(12345)
> var <- sample(100, replace=TRUE)
>
>
> #Lets create frequency table
> x <- as.data.frame(table(var))
> x$var <- as.numeric(as.character(x$var))
> head(x)
var Freq
1 1 3
2 2 1
3 4 1
4 5 2
5 6 1
6 7 2
> #Create a vector of 0s
> freq <- rep(0, 100)
> #Replace the values in freq for those indices which equal x$var by x$Freq so that remaining
> #indices are ones which you intended to insert
> freq[x$var] <- x$Freq
> head(freq)
[1] 3 1 0 1 2 1
> #cbind data together
> freqdf <- as.data.frame(cbind(var = 1:100, freq))
> head(freqdf)
var freq
1 1 3
2 2 1
3 3 0
4 4 1
5 5 2
6 6 1
try something like this
insertRowToDF<-function(X,index_after,vector_to_insert){
stopifnot(length(vector_to_insert) == ncol(X)); # to check valid row to be inserted
X<-rbind(X[1:index_after,],vector_to_insert,X[(index_after+1):nrow(X),]);
row.names(X)<-1:nrow(X);
return (X);
}
you can call it with
df<-insertRowToDF(df,16,c(16,0)); # inserting the values (16,0) after the 16th row
This is the Aditya's code plus some conditions to handle special cases:
insertRowToDF<-function(X,index_after,vector_to_insert){
stopifnot(length(vector_to_insert) == ncol(X)); # to check valid row to be inserted
if (index_after != 0) {
if (dim(X)[1] != index_after) {
X <- rbind(X[1:index_after,], vector_to_insert, X[(index_after+1):nrow(X),]);
} else {
X <- rbind(X[1:index_after,], vector_to_insert);
}
} else {
if (dim(X)[1] != index_after) {
X <- rbind(vector_to_insert, X[(1):nrow(X),]);
} else {
X <- rbind(vector_to_insert);
}
}
row.names(X)<-1:nrow(X);
return (X);
}

How to count and test for the sum and repeat the action

I need to test the value of'peso'(see replication code below) for each factor. Whether a factor reaches 50% of the overall sum for 'peso', the values of each factor should be paste into a new object 'results', otherwise, R should evaluate which factor has the lowest aggregated value for 'peso', and consider the factor in the next column for aggregate 'peso' again. Basically, this process replace the lowest scored factor for the next factor. The process should repeat till a factor cross the 50% threshold. So my question is, where do I start?
set.seed(51)
Data <- sapply(1:100, function(x) sample(1:10, size=5))
Data <- data.frame(t(Data))
names(Data) <- letters[1:5]
Data$peso <- sample(0:3.5, 100, rep=TRUE)
It should be like
If your first two rows are:
a b c d e peso
8 2 3 7 9 1
8 3 4 5 7 3
9 7 4 10 1 2
10 3 4 5 7 3
What would you like for the total?
Totals_08 = 4
Totals_09 = 2
Totals_10 = 3
etc?
So, factor 8 got the greater share 4/(4+2+3) = 0.4444444, but not reached 50% threshold in the round a. Therefore, I need something more: repeat the aggregation but considering now the factor 7 in the column 'b' instead of factors 9 in the column 'a', since it got the lowest aggregated value in the first round.
It's unclear if you have your list of factors already or not. If you do not have it, and are taking it from the data set, you can grab it in a few different ways:
# Get a list of all the factors
myFactors <- levels(Data[[1]]) # If actual factors.
myFactors <- sort(unique(unlist(Data))) # Otherwise use similar to this line
Then to calculate the Totals per factor, you can do the following
Totals <-
colSums(sapply(myFactors, function(fctr)
# calculate totals per fctr
as.integer(Data$peso) * rowSums(fctr == subset(Data, select= -peso))
))
names(Totals) <- myFactors
Which gives
Totals
# 1 2 3 4 5 6 7 8 9 10
# 132 153 142 122 103 135 118 144 148 128
Next:
I'm not sure if afterwards, you want to compare to the sum of peso or the sum of the totals. Here are both options, broken down into steps:
# Calculate the total of all the Totals:
TotalSum <- sum(Totals)
# See percentage for each:
Totals / TotalSum
Totals / sum(as.integer(Data$peso))
# See which, if any, is greater than 50%
Totals / TotalSum > 0.50
Totals / sum(as.integer(Data$peso)) > 0.50
# Using Which to identify the ones you are looking for
which(Totals / TotalSum > 0.50)
which(Totals / sum(as.integer(Data$peso)) > 0.50)
Note on your sampling for Peso
You took a sample of 0:3.5, however, the x:y sequence only gives integers.
If you want fractions, you can either use seq() or you can take a larger sequence and then divide appropriately:
option1 <- (0:7) / 2
option2 <- seq(from=0, to=3.5, by=0.5)
If you want whole integers from 0:3 and also the value 3.5, then use c()
option3 <- c(0:3, 3.5)

R - Filter Data from a data frame

I am a new guy in R and really unsure how to filter data in date frame.
I have created a data frame with two columns including monthly date and corresponding temperature. It has a length of 324.
> head(Nino3.4_1974_2000)
Month_common Nino3.4_degree_1974_2000_plain
1 1974-01-15 -1.93025
2 1974-02-15 -1.73535
3 1974-03-15 -1.20040
4 1974-04-15 -1.00390
5 1974-05-15 -0.62550
6 1974-06-15 -0.36915
The filter rule is to select the temperature which are greater or equal to 0.5 degree. Also, it has to be at least continuously 5 months.
I have eliminate the data with less than 0.5 degree temperature (see below).
for (i in 1) {
el_nino=Nino3.4_1974_2000[which(Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain >= 0.5),]
}
> head(el_nino)
Month_common Nino3.4_degree_1974_2000_plain
32 1976-08-15 0.5192000
33 1976-09-15 0.8740000
34 1976-10-15 0.8864501
35 1976-11-15 0.8229501
36 1976-12-15 0.7336500
37 1977-01-15 0.9276500
However, i still need to extract continuously 5 months. I wish someone could help me out.
If you can always rely on the spacing being one month, then let's temporarily discard the time information:
temps <- Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain
So, since every temperature in that vector is always separated by one month, we just have to look for runs where the temps[i]>=0.5, and the run has to be at least 5 long.
If we do the following:
ofinterest <- temps >= 0.5
we'll have a vector ofinterest with values TRUE FALSE FALSE TRUE TRUE .... etc where it's TRUE when temps[i] was >= 0.5 and FALSE otherwise.
To rephrase your problem then, we just need to look for occurences of at least five TRUE in a row.
To do this we can use the function rle. ?rle gives:
> ?rle
Description
Compute the lengths and values of runs of equal values in a vector
- or the reverse operation.
Value:
‘rle()’ returns an object of class ‘"rle"’ which is a list with
components:
lengths: an integer vector containing the length of each run.
values: a vector of the same length as ‘lengths’ with the
corresponding values.
So we use rle which counts up all the streaks of consecutive TRUE in a row and consecutive FALSE in a row, and look for at least 5 TRUE in a row.
I'll just make up some data to demonstrate:
# for you, temps <- Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain
temps <- runif(1000)
# make a vector that is TRUE when temperature is >= 0.5 and FALSE otherwise
ofinterest <- temps >= 0.5
# count up the runs of TRUEs and FALSEs using rle:
runs <- rle(ofinterest)
# we need to find points where runs$lengths >= 5 (ie more than 5 in a row),
# AND runs$values is TRUE (so more than 5 'TRUE's in a row).
streakIs <- which(runs$lengths>=5 & runs$values)
# these are all the el_nino occurences.
# We need to convert `streakIs` into indices into our original `temps` vector.
# To do this we add up all the `runs$lengths` up to `streakIs[i]` and that gives
# the index into `temps`.
# that is:
# startMonths <- c()
# for ( n in streakIs ) {
# startMonths <- c(startMonths, sum(runs$lengths[1:(n-1)]) + 1
# }
#
# However, since this is R we can vectorise with sapply:
startMonths <- sapply(streakIs, function(n) sum(runs$lengths[1:(n-1)])+1)
Now if you do Nino3.4_1974_2000$Month_common[startMonths] you'll get all the months in which the El Nino started.
It boils down to just a few lines:
runs <- rle(Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain>=0.5)
streakIs <- which(runs$lengths>=5 & runs$values)
startMonths <- sapply(streakIs, function(n) sum(runs$lengths[1:(n-1)])+1)
Nino3.4_1974_2000$Month_common[startMonths]
Here's one way using the fact that the months are regular always one month apart. Than the problem reduces to finding 5 consecutive rows with temps >= 0.5 degrees:
# Some sample data
d <- data.frame(Month=1:20, Temp=c(rep(1,6),0,rep(1,4),0,rep(1,5),0, rep(1,2)))
d
# Use rle to find runs of temps >= 0.5 degrees
x <- rle(d$Temp >= 0.5)
# The find the last row in each run of 5 or more
y <- x$lengths>=5 # BUG HERE: See update below!
lastRow <- cumsum(x$lengths)[y]
# Finally, deduce the first row and make a result matrix
firstRow <- lastRow - x$lengths[y] + 1L
res <- cbind(firstRow, lastRow)
res
# firstRow lastRow
#[1,] 1 6
#[2,] 13 17
UPDATE I had a bug that detected runs with 5 values less than 0.5 too. Here's the updated code (and test data):
d <- data.frame(Month=1:20, Temp=c(rep(0,6),1,0,rep(1,4),0,rep(1,5),0, 1))
x <- rle(d$Temp >= 0.5)
y <- x$lengths>=5 & x$values
lastRow <- cumsum(x$lengths)[y]
firstRow <- lastRow - x$lengths[y] + 1L
res <- cbind(firstRow, lastRow)
res
# firstRow lastRow
#[2,] 14 18

Resources