Finding sum in all rows with R - r

I have two vectors
x <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,4,5,5,6,6,6,6)
y <- c(1,1,2,3,4,2,2,4,4,4,3,3,1,4,2,3,1,4,4,4,2,2,2,3,3)
I found the number of values for each x (from 1 to 6) as
t=table(x,y)
and get the table with 6 rows and 4 columns. Then I calculate the sum in all rows as s=apply(t,1,sum) and get the error. Could anybody explain what I do wrong?

What is the error? I don't get one with apply(t, 1, sum). Try instead
rowSums(t)
##1 2 3 4 5 6
##5 5 4 5 2 4
Or, you could simply use table(x), which gives you exactly the same output.

Related

R: Assign Rank 1 to Predifined Largest Value

I have dataset like this:
Value
5
4
2
1
I want the largest value to have the smallest rank while the lowest value to have the highest rank.
In this dataset, Value=1 will recode to 5 while Value=5 will recode to 1.
However, due to the missing Value=3 in my dataset, by using the rank function rank(-Value), I only managed to get this
Value Rank
5 1
4 2
2 3
1 4
Is there any way in R to get something like this?
Value Rank
5 1
4 2
2 4
1 5
Try it like this:
df <- data.frame(Value = c(5, 4, 2, 1))
df$fact <- as.factor(df$Value)
df$Rank <- as.numeric(rev(levels(df$fact)))[df$fact]
> (df <- df[, -2])
Value Rank
1 5 1
2 4 2
3 2 4
4 1 5
You can do this by finding the max and min values of your vector and then searching for the index within a complete number set between the max and min.
v <- c(5,4,2)
x <- min(v)
y <- max(v)
x:y
match(v,x:y)
[1] 4 3 1
Using the levels of a factor as J.Win. suggests will work as long as there is a 1 in your vector but otherwise, the highest value will not have a rank of 1. Sorry, I do not have enough reputation to add this as a comment.

replace values in one dataset with values in another dataset R

I have a somewhat seemingly simple problem that I am stumped with. I have a df, say:
x y z
0 1 2
3 5 4
1 0 5
0 5 0
and another:
x y z
1 5 6
2 4 5
4 5 7
5 8 5
I want to replace the zero values in df1 with the value in df2. E.g., cell 1 of df1 would be 1 instead of zero. I want this for all columns in a dataframe. Can you help me code? I cant seem to figure it out. Thanks!
First, you can locate the indices of 0's using which
zero_locations <- which(df1 == 0, arr.ind=TRUE)
Then, you can use the locations to make the replacements:
df1[zero_locations] <- df2[zero_locations]
As David Arenburg pointed out in the comments, which isn't strictly necessary:
zero_locations <- df1 == 0
Will work as well.

Arguments for Subset within a function in R colon v. greater or equal to

Suppose I have the following data.
x<- c(1,2, 3,4,5,1,3,8,2)
y<- c(4,2, 5,6,7,6,7,8,9)
data<-cbind(x,y)
x y
1 1 4
2 2 2
3 3 5
4 4 6
5 5 7
6 1 6
7 3 7
8 8 8
9 2 9
Now, if I subset this data to select only the observations with "x" between 1 and 3 I can do:
s1<- subset(data, x>=1 & x<=3)
and obtain my desired output:
x y
1 1 4
2 2 2
3 3 5
4 1 6
5 3 7
6 2 9
However, if I subset using the colon operator I obtained a different result:
s2<- subset(data, x==1:3)
x y
1 1 4
2 2 2
3 3 5
This time it only includes the first observation in which "x" was 1,2, or 3. Why?
I would like to use the ":" operator because I am writing a function so the user would input a range of values from which she wants to see an average calculated over the "y" variable. I would prefer if they can use ":" operator to pass this argument to the subset function inside my function but I don't know why subsetting with ":" gives me different results.
I'd appreciate any suggestions on this regard.
You can use %in% instead of ==
subset(data, x %in% 1:3)
In general, if we are comparing two vectors of unequal sizes, %in% would be used. There are cases where we can take advantage of the recycling (it can fail too) if the length of one of the vector is double that of the second. Some examples with some description is here.

group and label rows in data frame by numeric in R

I need to group and label every x observations(rows) in a dataset in R.
I need to know if the last group of rows in the dataset has less than x observations
For example:
If I use a dataset with 10 observations and 2 variables and I want to group by every 3 rows.
I want to add a new column so that the dataset looks like this:
speed dist newcol
4 2 1
4 10 1
7 4 1
7 22 2
8 16 2
9 10 2
10 18 3
10 26 3
10 34 3
11 17 4
df$group <- rep(1:(nrow(df)/3), each = 3)
This works if the number of rows is an exact multiple of 3. Every three rows will get tagged in serial numbers.
A quick dirty way to tackle the problem of not knowing how incomplete the final group is to simply check the remained when nrow is modulus divided by group size: nrow(df) %% 3 #change the divisor to your group size
assuming your data is df you can do
df$newcol = rep(1:ceiling(nrow(df)/3), each = 3)[1:nrow(df)]

Computing difference between rows in a data frame

I have a data frame. I would like to compute how "far" each row is from a given row. Let us consider it for the 1st row. Let the data frame be as follows:
> sampleDF
X1 X2 X3
1 5 5
4 2 2
2 9 1
7 7 3
What I wish to do is the following:
Compute the difference between the 1st row & others: sampleDF[1,]-sampleDF[2,]
Consider only the absolute value: abs(sampleDF[1,]-sampleDF[2,])
Compute the sum of the newly formed data frame of differences: rowSums(newDF)
Now to do this for the whole data frame.
newDF <- sapply(2:4,function(x) { return (abs(sampleDF[1,]-sampleDF[x,]));})
This creates a problem in that the result is a transposed list. Hence,
newDF <- as.data.frame(t(sapply(2:4,function(x) { return (abs(sampleDF[1,]-sampleDF[x,]));})))
But another problem arises while computing rowSums:
> class(newDF)
[1] "data.frame"
> rowSums(newDF)
Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) :
'x' must be numeric
> newDF
X1 X2 X3
1 3 3 3
2 1 4 4
3 6 2 2
>
Puzzle 1: Why do I get this error? I did notice that newDF[1,1] is a list & not a number. Is it because of that? How can I ensure that the result of the sapply & transpose is a simple data frame of numbers?
So I proceed to create a global data frame & modify it within the function:
sapply(2:4,function(x) { newDF <<- as.data.frame(rbind(newDF,abs(sampleDF[1,]-sampleDF[x,])));})
> newDF
X1 X2 X3
2 3 3 3
3 1 4 4
4 6 2 2
> rowSums(outDF)
2 3 4
9 9 10
>
This is as expected.
Puzzle 2: Is there a cleaner way to achieve this? How can I do this for every row in the data frame (shown above is only for "distance" from row 1. Would need to do this for other rows as well)? Is running a loop the only option?
To put it in words, you are trying to compute the Manhattan distance:
dist(sampleDF, method = "Manhattan")
# 1 2 3
# 2 9
# 3 9 10
# 4 10 9 9
Regarding your implementation, I think the problem is that your inner function is returning a data.frame when it should return a numeric vector. Doing return(unlist(abs(sampleDF[1,]-sampleDF[x,]))) should fix it.

Resources