Replace dataframe missing values with linear trend - r

I've an imported dataframe where some values are missing in x2. Here a simplified example.
I'd like to replace the missing values with a linear trend between the last and next available.
Any suggestion on how to do it?
a <- data.frame(x1=1:11, x2=c(6,"","","","",12,"","",4,"",20))
a
x1 x2
1 1 6
2 2
3 3
4 4
5 5
6 6 12
7 7
8 8
9 9 4
10 10
11 11 20

You can try approx like below
transform(
a,
x2 = approx(x1[nzchar(x2)], na.omit(as.numeric(x2)), x1)$y
)
which gives
x1 x2
1 1 6.000000
2 2 7.200000
3 3 8.400000
4 4 9.600000
5 5 10.800000
6 6 12.000000
7 7 9.333333
8 8 6.666667
9 9 4.000000
10 10 12.000000
11 11 20.000000

Related

how to get row average for certain columns in r data frame?

I have data that looks like this
t=c(3,2,9,8)
u=c(5,6,7,8)
v=c(3,2,1,9)
w=c(5,6,7,8)
x=c(1,2,3,4)
y=c(4,3,2,1)
z=data.frame(t,u,v,w,x,y)
output:
t u v w x y
1 3 5 3 5 1 4
2 2 6 2 6 2 3
3 9 7 1 7 3 2
4 8 8 9 8 4 1
I would like to get the mean of each row for the first three columns, and then get the mean of each row for the last three columns. Ex. mean of row 1, columns t-v and mean of row 1, columns w-y, and so on.
Desired output:
t u v avg w x y avg2
1 3 5 3 3.6 5 1 4 3
2 2 6 2 3.3 6 2 3 3.6
3 9 7 1 5.6 7 3 2 4
4 8 8 9 8.3 8 4 1 4.3
How can I go about doing this?
Use rowMeans(). Using column names:
z$avg <- rowMeans(z[c("t", "u", "v")])
z$avg2 <- rowMeans(z[c("w", "x", "y")])
Result:
t u v w x y avg avg2
1 3 5 3 5 1 4 3.666667 3.333333
2 2 6 2 6 2 3 3.333333 3.666667
3 9 7 1 7 3 2 5.666667 4.000000
4 8 8 9 8 4 1 8.333333 4.333333
Alternative using column indices, with re-arranged output:
z$avg <- rowMeans(z[1:3])
z$avg2 <- rowMeans(z[4:6])
z <- z[c(1:3, 7, 4:6, 8)]
Result:
t u v avg w x y avg2
1 3 5 3 3.666667 5 1 4 3.333333
2 2 6 2 3.333333 6 2 3 3.666667
3 9 7 1 5.666667 7 3 2 4.000000
4 8 8 9 8.333333 8 4 1 4.333333
One more alternative using tidyverse is the rowwise() and c_across
z <- z %>%
rowwise() %>%
mutate(avg=mean(c_across(1:6)))

Plot in one graph 2 sets of 3 variables with reversed y axis

I have 2 tables, each one with 3 variables and a monthly value:
table1
x1 x2 x3
6 10 4 #jan
8 12 3 #feb
2 13 5 #mar
1 17 2 #apr
9 10 7 #may
5 15 1 #jun
6 19 3 #jul
3 13 8 #aug
6 18 2 #sep
8 11 4 #oct
2 15 6 #nov
1 17 2 #dec
table2
x1 x2 x3
3 11 1 #jan
5 15 6 #feb
7 11 2 #mar
3 16 4 #apr
7 12 5 #may
4 13 3 #jun
3 17 5 #jul
4 15 9 #aug
7 16 4 #sep
9 13 5 #oct
4 17 6 #nov
2 14 4 #dec
now I want to compare the 2 datasets in one graph in this style:
as you cann see, the first two variables should be adjusted to the standard direction of the y axis (increasing upwards) but the third variable should have a reverse y axis at the right side, so you can see it at the upper part of the graph.
how can I achieve this with both sets in the plot for comparison?

Average all other columns based on one column in matrix [duplicate]

This question already has answers here:
Summarize all group values and a conditional subset in the same call
(4 answers)
Closed 4 years ago.
I need to average a large number of columns based on the name in another column. My matrix looks like this (with separate unique row names):
Names X1 Y1 Z1 X2 Y2 Z2
P.maccus 4 2 2 6 5 3
P.maccus 6 5 3 7 6 5
P.maccus 8 3 2 8 7 3
A.ammophius 3 6 2 7 5 5
P.sabaji 2 5 3 8 4 5
P.sabaji 4 6 3 9 6 5
P.sabaji 5 7 2 8 7 3
P.sabaji 3 5 3 9 5 4
I need to average each row to look like this:
Names X1 Y1 Z1 X2 Y2 Z2
P.maccus 6 3.33 2.33 7 6 3.66
A.ammophius 3 6 2 7 5 5
P.sabaji 3.5 5.75 2.75 8.5 5.5 4.25
Can anyone help? Thank you!
This is pretty easy with dplyr. you can do
dd %>% group_by(Names) %>% summarize_all(mean)
tested with the following data
dd<-read.table(text="Names X1 Y1 Z1 X2 Y2 Z2
P.maccus 4 2 2 6 5 3
P.maccus 6 5 3 7 6 5
P.maccus 8 3 2 8 7 3
A.ammophius 3 6 2 7 5 5
P.sabaji 2 5 3 8 4 5
P.sabaji 4 6 3 9 6 5
P.sabaji 5 7 2 8 7 3
P.sabaji 3 5 3 9 5 4", header=TRUE)
You can use aggregate() for that.
Assuming your data matrix is in variable named df:
aggregate(. ~ Names, data=df, FUN=mean)
Names X1 Y1 Z1 X2 Y2 Z2
1 A.ammophius 3.0 6.000000 2.000000 7.0 5.0 5.000000
2 P.maccus 6.0 3.333333 2.333333 7.0 6.0 3.666667
3 P.sabaji 3.5 5.750000 2.750000 8.5 5.5 4.250000

Picking a variable in R

I have two variables: X and state which are given below
set.seed(3)
state <- rbinom(15,4,0.6)
X <- c(1:15)
X
state
and the output is
> state
[1] 3 2 3 3 2 2 4 3 2 2 2 2 2 2 1
> X
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
I want to select the Xs corresponding to the same state. Any idea how to do this in R?
Using split you get a list of 4 states
ll <- split(X,state)
$`1`
[1] 15
$`2`
[1] 2 5 6 9 10 11 12 13 14
$`3`
[1] 1 3 4 8
$`4`
[1] 7
ll[3]
$`3`
[1] 1 3 4 8
generally we use , ave to perform some operations while grouping.
For example here I get the mean of X by state:
ave(X,state,FUN = mean)
[1] 4.000000 9.111111 4.000000 4.000000 9.111111 9.111111 7.000000 4.000000 9.111111 9.111111 9.111111 9.111111 9.111111 9.111111 15.000000
Another way could be to put you variables in a data frame and then select them from there:
> df <- data.frame(x = X, state = state)
> df
x state
1 1 3
2 2 2
3 3 3
4 4 3
5 5 2
6 6 2
7 7 4
8 8 3
9 9 2
10 10 2
11 11 2
12 12 2
13 13 2
14 14 2
15 15 1
> df[df$state == 3,]
x state
1 1 3
3 3 3
4 4 3
8 8 3

Sort a matrix based on the number of repeating elements in a particular column in R

I have a matrix with only column labels, and I want to sort by column A, where repeating elements are ranked before non-repeating. So because 7 appears four times in column A then is is moved to be in front of the rows with 2 in column A. I hope this makes sense.
A B C
1 11 14
2 2 2
2 5 12
2 13 2
3 16 19
3 10 0
4 20 17
5 5 16
7 14 18
7 8 10
7 10 17
7 7 0
Now, what I want it to look like is the following.
A B C
7 14 18
7 8 10
7 10 17
7 7 0
2 2 2
2 5 12
2 13 2
3 16 19
3 10 0
1 11 14
4 20 17
5 5 16
Thank you so much for your assistance.
You're question needs to be a lot clearer. How are the values in B and C determined? From the description, it sounds like these should just be the corresponding values to those in the A column of the original data, but that's not the case in your example.
Until you clarify further, here's a way in base R that sorts the rows by A according to your condition.
d <- as.matrix(read.table(text="A B C
1 11 14
2 2 2
2 5 12
2 13 2
3 16 19
3 10 0
4 20 17
5 5 16
7 14 18
7 8 10
7 10 17
7 7 0", header=TRUE))
counts <- table(d[,'A'])
ranks <- rank(interaction(counts, names(counts), lex.order=TRUE))
d[order(ranks[match(d[,'A'], names(counts))], decreasing=TRUE), ]
# A B C
# [1,] 7 14 18
# [2,] 7 8 10
# [3,] 7 10 17
# [4,] 7 7 0
# [5,] 2 2 2
# [6,] 2 5 12
# [7,] 2 13 2
# [8,] 3 16 19
# [9,] 3 10 0
# [10,] 5 5 16
# [11,] 4 20 17
# [12,] 1 11 14
library(plyr)
counts <- count(df, 'A')
df[order(merge(df, counts)$freq, decreasing=TRUE), ]

Resources