I have a data frame for which I need to change all negative values to positive then the changed values multiply by 100 i.e. multiply all negative values by -100. I MUST use for loop and if or ifelse.
My data frame; x = factor(c("a","b","c","d","e"), y = seq(-4, 4, by = 2), z = c(3,4,-5,6,-8)
x y z
1 a -4 3
2 b -2 4
3 c 0 -5
4 d 2 6
5 e 4 -8
So far I have succeeded in changing two of the negative values but for some reason the other to didn't change.
Here is the code:
for(i in 2:length(df)){
value <- df[[i]][i]
if(value < 0){
df[[i]][i] = value* -100
}
}
The result
x y z
1 a -4 3
2 b 200 4
3 c 0 500
4 d 2 6
5 e 4 -8
as you can see the the two negative values at [2,2] and [3,3] have been multiplied by -100 but the other two have not. Can anyone help me understand why this happened?
Thanks!
Take a look at the simplicity of this application of two vectorized functions abs and *to the last two columns:
dfrm <- read.table(text="x y z
1 a -4 3
2 b -2 4
3 c 0 -5
4 d 2 6
5 e 4 -8",head=TRUE, colClasses=c( "character", "factor", "numeric", "numeric") )
# I have no idea what your statement "z is c()" means
dfrm[-1] <- abs(dfrm[-1])*100
> dfrm
x y z
1 a 400 300
2 b 200 400
3 c 0 500
4 d 200 600
5 e 400 800
To answer some of your specific questions:
The if statement is not being ignore in:
if(i < 0) {
df <- df[ ,i]*-100
{
}
The value of i in the loop index will never by <0 so the expression i,0 will always be TRUE. (That conditional really makes no sense.)
If you want to make an assignment only in those rows were the value was negative, you could assign with a logical index
Related
I've read most of the similar questions here, but I'm still having a hard time understanding how passing arguments in the order function break ties.
The example introduced in the R documentation shows that :
order(x <- c(1,1,3:1,1:4,3), y <- c(9,9:1), z <- c(2,1:9))
returns
[1] 6 5 2 1 7 4 10 8 3 9
However, what does it mean when y is 'breaking ties' of x, and z 'breaking ties' of y? the x vector is:
[1] 1 1 3 2 1 1 2 3 4 3
and the y vector is:
[1] 9 9 8 7 6 5 4 3 2 1
Also, if I eliminate z from the first function,
order(x <- c(1,1,3:1,1:4,3), y <- c(9,9:1))
it returns :
[1] 6 5 1 2 7 4 10 8 3 9
so I'm unclear how the numbers in the y vector are relevant with ordering the four 1s, the two 2s, and the three 3s in x. I would very much appreciate the help. Thanks!
Let's take a look at
idx <- order(x <- c(1,1,3:1,1:4,3), y <- c(9,9:1), z <- c(2,1:9))
idx;
#[1] 6 5 2 1 7 4 10 8 3 9
First thing to note is that
x[idx]
# [1] 1 1 1 1 2 2 3 3 3 4
So idx orders entries in x from smallest to largest values.
Values in y and z affect how order treats ties in x.
Take entries x[5] = 1 and x[6] = 1. Since there is a tie here, order looks up entries at the corresponding positions in y, i.e. y[5] = 6 and y[6] = 5. Since y[6] < y[5], the entries in x are sorted x[6] < x[5].
If there is a tie in y as well, order will look up entries in the next vector z. This happens for x[1] = 1 and x[2] = 2, where both y[1] = 9 and y[2] = 9. Here z breaks the tie because z[2] = 1 < z[1] = 2 and therefore x[2] < x[1].
Consider i have a df
> editor
A B C D E F G H I J
User1 1 0 5 6 5 6 5 6 2 6
User2 0 5 4 6 4 5 5 1 7 5
I want to store the column name of the first occuring 2nd largest value in above rows.
Expected results
> editor
A B C D E F G H I J 2nd_highest
User1 1 0 5 6 5 6 5 6 2 6 C
User2 0 5 4 6 4 5 5 1 7 5 D
i tried edited$2nd_highest <- colnames(edited)[apply(edited, 1, which.max)+1] but did'nt worked well .
Any ideas ?
Here's an attempt to achieve this using algebra in order to keep it vectorized and avoid by row operations (though it still does a matrix conversion similar to apply). The idea here is to find the maximum- then reduce it from the data set, then convert to log (after multiplying by -1) which will result in the largest value becoming -Inf (meaning the smallest value) and then do 1/result in order to find the largest value out of the values left.
indx <- max.col(1/log((editor - editor[cbind(1:nrow(editor),
max.col(editor))]) * -1), ties.method = "first")
names(editor)[indx]
# [1] "C" "D"
Here is an idea. We first sort the unique values of each row and extract the second value. Since we specify decreasing = TRUE, then the second value will be the second highest. We then use the first value of each element of the new list as the index for the column names
ind_lst <- apply(df, 1, function(i) which(i == sort(unique(i), decreasing = TRUE)[2]))
df$highest.two <- names(df)[unlist(lapply(ind_lst, '[', 1))]
df
# A B C D E F G H I J highest.two
#User1 1 0 5 6 5 6 5 6 2 6 C
#User2 0 5 4 6 4 5 5 1 7 5 D
This can help you:
mat <- matrix(sample(1:8, 24, replace=TRUE), ncol=6)
mat
sec_highest <- apply(mat, 1, function(x) which(x == max(x[which(x != max(x))])))
LETTERS[sec_highest] # letters display
Note that if you have two second highests with same scores, only one will be displayed.
Here was discussed the question of calculation of means and medians of vector t, for each value of vector y (from 1 to 4) where x=1, z=1, using aggregate function in R.
x y z t
1 1 1 10
1 0 1 15
2 NA 1 14
2 3 0 15
2 2 1 17
2 1 NA 19
3 4 2 18
3 0 2 NA
3 2 2 45
4 3 2 NA
4 1 3 59
5 0 3 0
5 4 3 45
5 4 4 74
5 1 4 86
Multiple aggregation in R with 4 parameters
But how can I for each value (from 1 to 5) of vector x calculate (mean(y)+mean(z))/(mean(z)-mean(t)) ? And do not make calculations for values 0 and NA in any vector. For example, in vector y the 3rd value is 0, so the 3rd number in every vector (y,z,t) should not be used. And in result the the third row (for x=3) should be NA.
Here is the code for calculating means of y,z and t and it`s needed to add the formula for calculation (mean(y)+mean(z))/(mean(z)-mean(t)):
data <- data.table(dataframe)
bar <- data[,.N,by=x]
foo <- data[ ,list(mean.y =mean(y, na.rm = T),
mean.z=mean(z, na.rm = T),
mean.t=mean(t,na.rm = T)),
by=x]
In this code for calculating means all rows are used, but for calculating (mean(y)+mean(z))/(mean(z)-mean(t)), any row where y or z or t equal to zero or NA should not be used.
Update:
Oh, this can be further simplified, as data.table doesn't subset NA by default (especially with such cases in mind, similar to base::subset). So, you just have to do:
dt[y != 0 & z != 0 & t != 0,
list(ans = (mean(y) + mean(z))/(mean(z) - mean(t))), by = x]
FWIW, here's how I'd do it in data.table:
dt[(y | NA) & (z | NA) & (t | NA),
list(ans=(mean(y)+mean(z))/(mean(z)-mean(t))), by=x]
# x ans
# 1: 1 -0.22222222
# 2: 2 -0.18750000
# 3: 3 -0.16949153
# 4: 4 -0.07142857
# 5: 5 -0.10309278
Let's break it down with the general syntax: dt[i, j, by]:
In i, we filter out for your conditions using a nice little hack TRUE | NA = TRUE and FALSE | NA = NA and NA | NA = NA (you can test these out in your R session).
Since you say you need only the non-zero non-NA values, it's just a matter of |ing each column with NA - which'll return TRUE only for your condition. That settles the subset by condition part.
Then for each group in by, we aggregate according to your function, in j, to get the result.
HTH
Here's one solution:
# create your sample data frame
df <- read.table(text = " x y z t
1 1 1 10
1 0 1 15
2 NA 1 14
2 3 0 15
2 2 1 17
2 1 NA 19
3 4 2 18
3 0 2 NA
3 2 2 45
4 3 2 NA
4 1 3 59
5 0 3 0
5 4 3 45
5 4 4 74
5 1 4 86", header = TRUE)
library('dplyr')
dfmeans <- df %>%
filter(!is.na(y) & !is.na(z) & !is.na(t)) %>% # remove rows with NAs
filter(y != 0 & z != 0 & t != 0) %>% # remove rows with zeroes
group_by(x) %>%
summarize(xmeans = (mean(y) + mean(z)) / (mean(z) - mean(t)))
I'm sure there is a simpler way to remove the rows with NAs and zeroes, but it's not coming to me. Anyway, dfmeans looks like this:
# x xmeans
# 1 1 -0.22222222
# 2 2 -0.18750000
# 3 3 -0.16949153
# 4 4 -0.07142857
# 5 5 -0.10309278
And if you just want the values from xmeans use dfmeans$xmeans.
I'm a beginner for R. Please help me with the coding of function as below. Thanks!
Create a function named CountNonpositives that takes a numeric dataframe as its only input parameter. This function should return a dataframe with one row for each column of the input dataframe. This output dataframe should have two columns, one giving the name of each column of the input dataframe and the other giving the number of observations of that variable which are not positive.
Note: missing values, if any, must be included in the nonpositive count.
The sapply is doing the trick for you. I trust, you can encapsulate it into a function of your specifics.
d <- data.frame(
x = c(sample(-10:10, 10, replace = TRUE),NA),
y = c(sample(-10:10, 10, replace = TRUE),NA),
z = c(sample(-10:10, 10, replace = TRUE),NA)
)
sapply(d, function(x) sum(x<0 & !is.na(x)) )
Preview -
> d
x y z
1 5 10 2
2 9 -2 -2
3 -9 10 -2
4 -1 0 0
5 2 -9 7
6 -5 7 -3
7 1 -7 10
8 -10 5 -8
9 8 6 -9
10 -8 10 -4
11 NA NA NA
> sapply(d, function(x) sum(x<0 & !is.na(x)) )
x y z
5 3 6
I am trying to loop a data matrix for each separate ID tag, “1”, “2” and “3” (see my data at the bottom). Ultimately I am doing this to transform the X and Y coordinates into a timeseries with the ts() function, but first i need to build a loop into the function that returns a timeseries for each separate ID. The looping itself works perfectly fine when I use the following code for a dataframe:
for(i in 1:3){
print(na.omit(xyframe[ID==i,]))
}
Returning the following output:
Timestamp X Y ID
1. 0 -34.012 3.406 1
2. 100 -33.995 3.415 1
3. 200 -33.994 3.427 1
Timestamp X Y ID
4. 0 -34.093 3.476 2
5. 100 -34.145 3.492 2
6. 200 -34.195 3.506 2
Timestamp X Y ID
7. 0 -34.289 3.522 3
8. 100 -34.300 3.520 3
9. 200 -34.303 3.517 3
Yet, when I want to produce a loop in a matrix with the same code:
for(i in 1:3){
print(na.omit(xymatrix[ID==i,])
}
It returns the following error:
Error in print(na.omit(xymatrix[ID == i, ]) :
(subscript) logical subscript too long
Why does it not work to loop the ID through a matrix while it does work for the dataframe and how would I be able to fix it?
Furthermore did I read that looping requires much more computational strength then doing the same thing vector based, would there be a way to do this vector based?
The data (simplification of the real data):
Timestamp X Y ID
1. 0 -34.012 3.406 1
2. 100 -33.995 3.415 1
3. 200 -33.994 3.427 1
4. 0 -34.093 3.476 2
5. 100 -34.145 3.492 2
6. 200 -34.195 3.506 2
7. 0 -34.289 3.522 3
8. 100 -34.300 3.520 3
9. 200 -34.303 3.517 3
The format xymatrix[ID==i,] doesn't work for matrix. Try this way:
for(i in 1:3){ print(na.omit(xymatrix[xymatrix[,'ID'] == i,])) }
In general, if you want to apply a function to a data frame, split by some factor, then you should be using one of the apply family of functions in combination with split.
Here's some reproducible sample data.
n <- 20
some_data <- data.frame(
x = sample(c(1:5, NA), n, replace= TRUE),
y = sample(c(letters[1:5], NA), n, replace= TRUE),
id = gl(3, 1, length = n)
)
If you want to print out the rows with no missing values, split by each ID level, then you want something like this.
lapply(split(some_data, some_data$grp), na.omit)
or more concisely using the plyr package.
library(plyr)
dlply(some_data, .(grp), na.omit)
Both methods return output like this
# $`1`
# x y grp
# 1 2 d 1
# 4 3 e 1
# 7 3 c 1
# 10 4 a 1
# 13 2 e 1
# 16 3 a 1
# 19 1 d 1
# $`2`
# x y grp
# 2 1 e 2
# 5 3 e 2
# 8 3 b 2
# $`3`
# x y grp
# 6 3 c 3
# 9 5 a 3
# 12 2 c 3
# 15 2 d 3
# 18 4 a 3