Naming variables according to rows in R - r

I have to data tables. Data table 1 has two variables and 561 observations while data table 2 has 563 variables and 10,000 observations. I'm trying to figure out how can I the observations of code_name variable from data table 1 to rename the variables in data table 2.
What I have:
Data table 1
code code_name
11 rasf
04 iadf
27 pqwr
09 pklf
86 irmw
30 pwql
Data table 2
activity subject V1 V2 V3 V4 V5 V6
5 2 0.29 0.19 5.3 1.8 8.3 0.3
9 7 0.11 0.10 7.8 2.0 0.5 0.9
9 7 0.19 1.10 8.0 1.9 0.4 0.7
What I need:
activity subject rasf iadf pqwr pklf irmw pwql
5 2 0.29 0.19 5.3 1.8 8.3 0.3
9 7 0.11 0.10 7.8 2.0 0.5 0.9
9 7 0.19 1.10 8.0 1.9 0.4 0.7
What I did:
#Extracts all rows and just column two from the data table 1
new_data_table1 <- data_table1[,2]
#Set names on data table 2 to build the final data
final_data <- setnames(data_table2, names(data_table2), c("activity", "subject", new_data_table1))
The problem with my code is that when I extract all rows from data table 1 it gives a long list, showing vectors for the structure and labels of the data. Because of that, when I run my code I get this table:
activity subject 243 244 245 246 247 248
5 2 0.29 0.19 5.3 1.8 8.3 0.3
9 7 0.11 0.10 7.8 2.0 0.5 0.9
9 7 0.19 1.10 8.0 1.9 0.4 0.7
The new names for the variables are numbers because they are the structures and not the labels.

we can use names function to naming variables according to rows
names(df1)[3:length(df1)] <- df$code_name
df1
activity subject rasf iadf pqwr pklf irmw pwql
1 5 2 0.29 0.19 5.3 1.8 8.3 0.3
2 9 7 0.11 0.10 7.8 2.0 0.5 0.9
3 9 7 0.19 1.10 8.0 1.9 0.4 0.7
data
df
code code_name
1 11 rasf
2 4 iadf
3 27 pqwr
4 9 pklf
5 86 irmw
6 30 pwql
df1
activity subject V1 V2 V3 V4 V5 V6
1 5 2 0.29 0.19 5.3 1.8 8.3 0.3
2 9 7 0.11 0.10 7.8 2.0 0.5 0.9
3 9 7 0.19 1.10 8.0 1.9 0.4 0.7

We can use grep to find the index of the column names in the second dataset that start with "V" followed by numbers and change it to the second column value from the first dataset.
names(df2)[grep("^V\\d+", names(df2))] <- as.character(df1[,2] )

Related

Times series in R : how to change y-axis?

New R user here, working with meteorological data (data frame is called "Stations"). Trying to plot 3 time series with temperature on y-axis with a regression line on each one, but I encounter a few problems and there is no error messages.
Loop doesn't seem to be working and I can't figure out why.
Didn't manage to change x-axis graduation values for years ("Année" in the data frame) instead of a number.
Title is the same for the 3 plots, how do I change it so each plot has its own title?
Regression line is not shown on the graph.
Thanks in advance!
Here is my code :
for (i in c(6,8,10))
plot(ts(Stations[,i]), col="dodgerblue4", xlab="Temps", ylab="Température", main="Genève")
for (i in c(6,8,10))
abline(h=Stations[,i])```
Nb.enr time Année Mois Jour T2m_GE pcp_GE T2m_PU pcp_PU T2m_NY
1 19810101 1981 1 1 1.3 0.3 2.8 0.0 2.3
2 19810102 1981 1 2 1.2 0.1 2.3 1.2 1.6
3 19810103 1981 1 3 4.1 21.8 4.9 5.2 3.8
4 19810104 1981 1 4 5.1 10.3 5.1 17.4 4.9
5 19810105 1981 1 5 0.9 0.0 1.0 0.1 0.8
6 19810106 1981 1 6 0.5 5.7 0.7 6.0 0.5
7 19810107 1981 1 7 -2.7 0.0 -2.1 0.1 -1.9
8 19810108 1981 1 8 -3.2 0.0 -4.1 0.0 -3.8
9 19810109 1981 1 9 -5.2 0.0 -3.5 0.0 -5.1
10 19810110 1981 1 10 -3.1 10.6 -0.9 6.0 -2.6

How do I remove NA from a data frame with the intention of using sapply on the data frame [duplicate]

This question already has answers here:
calculate the mean for each column of a matrix in R
(10 answers)
Closed 3 years ago.
I have a data frame:
colA colB
1 15.3 1.76
2 10.8 1.34
3 8.1 1.27
4 19.5 1.47
5 7.2 1.27
6 5.3 1.49
7 9.3 1.31
8 11.1 1.09
9 7.5 1.18
10 12.2 1.22
11 6.7 1.25
12 5.2 1.19
13 19.0 1.95
14 15.1 1.28
15 6.7 1.52
16 8.6 NA
17 4.2 1.12
18 10.3 1.37
19 12.5 1.19
20 16.1 1.05
21 13.3 1.32
22 4.9 1.03
23 8.8 1.12
24 9.5 1.70
How would I be able to remove/change the value of all NAs such that when I use sapply (i.e. sapply(x, mean)), I am taking the mean of 24 rows in the case of colA and 23 columns for colB?
I understand that data frames have to have the same number of rows so using something like na.omit() would not work because it'd remove, in this case, row 16; I'd lose a row of data when I'm calculating the mean for colA.
Thanks!
You should be able to pass na.rm = TRUE and get the mean.
Example:
df <- data.frame(A = 1:3, B = c(NA, 1, 2))
apply(df, 2, mean, na.rm = TRUE)
# A B
# 2.0 1.5

Aggregate/sum and N/A values

I have a problem with the way aggregate or N/A deals with sums.
I would like the sums per area.code from following table
test <- read.table(text = "
area.code A B C D
1 0 NA 0.00 NA NA
2 1 0.0 3.10 9.6 0.0
3 1 0.0 3.20 6.0 0.0
4 2 0.0 6.10 5.0 0.0
5 2 0.0 6.50 8.0 0.0
6 2 0.0 6.90 4.0 3.1
7 3 0.0 6.70 3.0 3.2
8 3 0.0 6.80 3.1 6.1
9 3 0.0 0.35 3.2 6.5
10 3 0.0 0.67 6.1 6.9
11 4 0.0 0.25 6.5 6.7
12 5 0.0 0.68 6.9 6.8
13 6 0.0 0.95 6.7 0.0
14 7 1.2 NA 6.8 0.0
")
So, seems pretty easy:
aggregate(.~area.code, test, sum)
area.code A B C D
1 1 0 6.30 15.6 0.0
2 2 0 19.50 17.0 3.1
3 3 0 14.52 15.4 22.7
4 4 0 0.25 6.5 6.7
5 5 0 0.68 6.9 6.8
6 6 0 0.95 6.7 0.0
Apparently not so simple, because area code 7 is completely omitted from the aggregate() command.
I would however like the N/As to be completely ignored or computed as zero values, which na= command gives that option?
replacing all N/As with 0 is an option if I just want the sum... but the mean is really problematic then (since it can't differentiate between 0 and N/A anymore)
If you are willing to consider an external package (data.table):
setDT(test)
test[, lapply(.SD, sum), area.code]
area.code A B C D
1: 0 NA 0.00 NA NA
2: 1 0.0 6.30 15.6 0.0
3: 2 0.0 19.50 17.0 3.1
4: 3 0.0 14.52 15.4 22.7
5: 4 0.0 0.25 6.5 6.7
6: 5 0.0 0.68 6.9 6.8
7: 6 0.0 0.95 6.7 0.0
8: 7 1.2 NA 6.8 0.0
One option is to create a function that gives NA when all the values are NA or otherwise use sum. Along with that, use na.action argument in aggregate as aggregate can remove the row if there is at least one NA
f1 <- function(x) if(all(is.na(x))) NA else sum(x, na.rm = TRUE)
aggregate(.~area.code, test, f1, na.action = na.pass)
# area.code A B C D
#1 0 NA 0.00 NA NA
#2 1 0.0 6.30 15.6 0.0
#3 2 0.0 19.50 17.0 3.1
#4 3 0.0 14.52 15.4 22.7
# 4 0.0 0.25 6.5 6.7
#6 5 0.0 0.68 6.9 6.8
#7 6 0.0 0.95 6.7 0.0
#8 7 1.2 NA 6.8 0.0
When there are only NA elements and we use sum with na.rm = TRUE, it returns 0
sum(c(NA, NA), na.rm = TRUE)
#[1] 0
Another solution is to use dplyr:
test %>%
group_by(area.code) %>%
summarise_all(sum, na.rm = TRUE)

how to filter data by condition to make the number of rows to be the same for each group

This is my sample data:
date label type exdate x y z w
1 10 A 2 15 0.25 0.35 13.49
1 10 A 2 12.5 1.30 1.45 13.49
1 10 B 2 10 1.7 1.8 13.49
1 10 B 2 12.5 0.3 0.4 13.49
1 10 B 2 17.5 1.8 0.3 13.49
1 11 A 3 15 0.75 0.8 13.49
1 11 A 3 12.5 1.8 1.9 13.49
1 11 A 3 17.5 0.2 0.35 13.49
1 11 B 3 10 0.1 0.25 13.49
1 11 B 3 15 2.15 2.3 13.49
1 11 B 3 12.5 0.8 0.85 13.49
1 11 B 3 17.5 4.1 4.3 13.49
2 11 A 4 10 3.7 4 13.49
2 11 A 4 15 1 1.1 13.49
2 11 A 4 12.5 2.05 2.2 13.49
2 11 A 4 17.5 0.4 0.55 13.49
2 11 B 4 10 0.3 0.4 13.49
2 11 B 4 15 2.45 2.6 13.49
2 11 B 4 12.5 1.05 1.15 13.49
2 11 B 4 17.5 4.3 4.6 13.49
Firstly, I will group my data set by c(date,label,exdate), and for each group it will be A and B inside variable 'type'. BUT I want to let the number of rows for type A and type B is the same.
Filter conditions:
To make data to be the same number of rows, the distance between x and w should be same or almost the same for any pairs of type A and type B.
For example:
type x w
A 2 3.5
A 3 3.5
A 4 3.5
B 1 3.5
B 2 3.5
# The output after filter
type x w
A 2 3.5 (pair with type_B ; x = 1)
A 3 3.5 (pair with type_B ; x = 2)
B 1 3.5
B 2 3.5
So, for the sample data above, the result I hope:
date label type exdate x y z w
1 10 A 2 15 0.25 0.35 13.49
1 10 A 2 12.5 1.30 1.45 13.49
1 10 B 2 12.5 0.3 0.4 13.49
1 10 B 2 17.5 1.8 0.3 13.49
1 11 A 3 15 0.75 0.8 13.49
1 11 A 3 12.5 1.8 1.9 13.49
1 11 A 3 17.5 0.2 0.35 13.49
1 11 B 3 15 2.15 2.3 13.49
1 11 B 3 12.5 0.8 0.85 13.49
1 11 B 3 17.5 4.1 4.3 13.49
2 11 A 4 10 3.7 4 13.49
2 11 A 4 15 1 1.1 13.49
2 11 A 4 12.5 2.05 2.2 13.49
2 11 A 4 17.5 0.4 0.55 13.49
2 11 B 4 10 0.3 0.4 13.49
2 11 B 4 15 2.45 2.6 13.49
2 11 B 4 12.5 1.05 1.15 13.49
2 11 B 4 17.5 4.3 4.6 13.49
To make this result, how can I code? Is it insert else if condition inside filter()?

r check and replace stuck data

There are two sensors. The collected data should be changing with time. How can identify the data stuck and replace it with another sensor?
a<- c(1:24)
b<- seq(0.1,2.4,0.1)
c<- c(0.05,0.2,0.3,rep(0.4,18),2.2,2.3,2.4)
d<- data.frame(a,b,c)
so the data has
d
a b c
1 0.1 0.05
2 0.2 0.20
3 0.3 0.30
4 0.4 0.40
5 0.5 0.40
6 0.6 0.40
7 0.7 0.40
8 0.8 0.40
9 0.9 0.40
10 1.0 0.40
11 1.1 0.40
12 1.2 0.40
13 1.3 0.40
14 1.4 0.40
15 1.5 0.40
16 1.6 0.40
17 1.7 0.40
18 1.8 0.40
19 1.9 0.40
20 2.0 0.40
21 2.1 0.40
22 2.2 2.20
23 2.3 2.30
24 2.4 2.40
Sensor c stuck at 0.4 from time a4 to a20, is there a quick way to identify it and replace the stuck part using data from sensor b?
The new column c_updated is what you want. I've created some helpful columns (c_previous and c_is_stuck) which you can remove if you want.
library(dplyr)
a<- c(1:24)
b<- seq(0.1,2.4,0.1)
c<- c(0.05,0.2,0.3,rep(0.4,18),2.2,2.3,2.4)
d<- data.frame(a,b,c)
d %>%
mutate(c_previous = lag(c, default = 0), # get previous measurement for sensor c
c_is_stuck = ifelse(c == c_previous, 1 ,0), # flag stuck for sensor c when current measurement is same as previous one
c_updated = ifelse(c_is_stuck == 1, b, c)) # if sensor c is stuck use measurement from sensor b
# a b c c_previous c_is_stuck c_updated
# 1 1 0.1 0.05 0.00 0 0.05
# 2 2 0.2 0.20 0.05 0 0.20
# 3 3 0.3 0.30 0.20 0 0.30
# 4 4 0.4 0.40 0.30 0 0.40
# 5 5 0.5 0.40 0.40 1 0.50
# 6 6 0.6 0.40 0.40 1 0.60
# 7 7 0.7 0.40 0.40 1 0.70
# 8 8 0.8 0.40 0.40 1 0.80
# 9 9 0.9 0.40 0.40 1 0.90
# 10 10 1.0 0.40 0.40 1 1.00
# 11 11 1.1 0.40 0.40 1 1.10
# 12 12 1.2 0.40 0.40 1 1.20
# 13 13 1.3 0.40 0.40 1 1.30
# 14 14 1.4 0.40 0.40 1 1.40
# 15 15 1.5 0.40 0.40 1 1.50
# 16 16 1.6 0.40 0.40 1 1.60
# 17 17 1.7 0.40 0.40 1 1.70
# 18 18 1.8 0.40 0.40 1 1.80
# 19 19 1.9 0.40 0.40 1 1.90
# 20 20 2.0 0.40 0.40 1 2.00
# 21 21 2.1 0.40 0.40 1 2.10
# 22 22 2.2 2.20 0.40 0 2.20
# 23 23 2.3 2.30 2.20 0 2.30
# 24 24 2.4 2.40 2.30 0 2.40
This is a pretty simple way. Duplicate the c column with an offset of 1 and check if the two values are identical. If so, take the value from b.
a<- c(1:24)
b<- seq(0.1,2.4,0.1)
c<- c(0.05,0.2,0.3,rep(0.4,18),2.2,2.3,2.4)
d<- data.frame(a,b,c)
d$d <- c(NA, d$c[1:23])
d$replaced <- ifelse(d$c == d$d, d$b, d$c)
a b c d replaced
1 1 0.1 0.05 NA NA
2 2 0.2 0.20 0.05 0.2
3 3 0.3 0.30 0.20 0.3
4 4 0.4 0.40 0.30 0.4
5 5 0.5 0.40 0.40 0.5
6 6 0.6 0.40 0.40 0.6
7 7 0.7 0.40 0.40 0.7
8 8 0.8 0.40 0.40 0.8
9 9 0.9 0.40 0.40 0.9
10 10 1.0 0.40 0.40 1.0
11 11 1.1 0.40 0.40 1.1
12 12 1.2 0.40 0.40 1.2
13 13 1.3 0.40 0.40 1.3
14 14 1.4 0.40 0.40 1.4
15 15 1.5 0.40 0.40 1.5
16 16 1.6 0.40 0.40 1.6
17 17 1.7 0.40 0.40 1.7
18 18 1.8 0.40 0.40 1.8
19 19 1.9 0.40 0.40 1.9
20 20 2.0 0.40 0.40 2.0
21 21 2.1 0.40 0.40 2.1
22 22 2.2 2.20 0.40 2.2
23 23 2.3 2.30 2.20 2.3
24 24 2.4 2.40 2.30 2.4
The bellow solution is as basic as it gets I think. No additional packages required. Cheers!
a<- c(1:24)
b<- seq(0.1,2.4,0.1)
c<- c(0.05,0.2,0.3,rep(0.4,18),2.2,2.3,2.4)
d<- data.frame(a,b,c)
d$diff.b <- c(NA, diff(d$b))
d$diff.c <- c(NA, diff(d$c))
stuck.index <- which(d$diff.c==0)
d[stuck.index, "c"] <- d[stuck.index, "b"]
# changing to original data frame format
d$diff.b <- NULL
d$diff.c <- NULL

Resources