I am trying to create a general linear regression model between a variable, and all variables before that, from two matrices, for every row.
I have two alike matrices with 30 rows and 41 columns, that both look like this:
Subject1 Subject2 Subject3 Subject4 Subject5 Subject6 Subject7 Subject8 Subject9
Trial 1 NA 66 NA 6 NA 45 NA NA NA
Trial 2 10 105 10 6 6 NA 6 10 15
Trial 3 NA 136 10 6 10 45 15 10 NA
Trial 4 10 NA 10 6 10 45 28 NA 6
Trial 5 10 NA 15 6 15 45 36 0 10
Trial 6 NA 21 NA 6 15 45 55 10 10
Where one is NA the other one has a value.
I'm trying to loop a regression for every Trial, where the predicted Trial value (n-th) has all the previous values (n-1) as regressors.
After a lot of research I found a way for all possible regressions, with
expand.grid(c(TRUE,FALSE), c(TRUE,FALSE), c(TRUE,FALSE), c(TRUE,FALSE))
And building from that, but as the number of my regressors for the last model is 29 times 2 because of the 2 matrices it would create a way too huge grid, also I only need the model for previous Trials.
Any help is very much appreciated.
Thanks.
Related
I have a large column with NAs, I want to rank the time as shown below. I want to keep NAs while I exclude them from the analysis,
df<-read.table(text="time
40
30
50
NA
60
NA
20
", header=True)
I want to get the following table:
time Rank
40 3
30 4
50 2
NA NA
60 1
NA NA
20 5
I have used the following code:
df$Rank<--df$time,ties.method="mim")
#fixed data
df<-read.table(text="time
40
30
50
NA
60
NA
20
", header=TRUE)
You can do something like
nonNaIndices <- !is.na(df$time)
df$Rank <- NA
df$Rank[nonNaIndices] <- rank(df$time[nonNaIndices],ties.method="min")
resulting in
> df
time Rank
1 40 3
2 30 2
3 50 4
4 NA NA
5 60 5
6 NA NA
7 20 1
Note: Please make sure to check your question for missing function calls before submitting it. In your case it could be guessed from the context.
You can use dense_rank from dplyr -
library(dplyr)
df %>% mutate(Rank = dense_rank(-time))
# time Rank
#1 40 3
#2 30 4
#3 50 2
#4 NA NA
#5 60 1
#6 NA NA
#7 20 5
I have a data base with 121 rows and like 10 columns. One of these columns corresponds to Station, another to depth and the rest to chemical variables (temperature, salinity, etc.). I want to calculate the integrated value of these chemical properties by station, using the function oce::integrateTrapezoid. It's my first time doing a loop, so i dont know how. Could you help me?
dA<-matrix(data=NA, nrow=121, ncol=3)
for (Station in unique(datos$Station))
{dA[Station, cd] <- integrateTrapezoid(cd, Profundidad..m., "cA")
}
Station
Depth
temp
1
10
28
1
50
25
1
100
15
1
150
10
2
9
27
2
45
24
2
98
14
2
152
11
3
11
28.7
3
48
23
3
102
14
3
148
9
I have a dataset with the reports from a local shop, where each line has a client's ID, date of purchase and total value per purchase.
I want to create a new plot where for each client ID I have all the purchases in the last month or even just sample purchases in a range of dates I choose.
The main problem is that certain customers might buy once a month, while others can come daily - so the number of observations per period of time can vary.
I have tried subsetting my dataset to a specific range of time, but either I choose a specific date - and then I only get a small % of all customers, or I choose a range and get multiple observations for certain customers.
(In this case - I wouldn't mind getting the earliest observation)
An important note: I know how to create a for loop to solve this problem, but since the dataset is over 4 million observations it isn't practical since it would take an extremely long time to run.
A basic example of what the dataset looks like:
ID Date Sum
1 1 1 234
2 1 2 45
3 1 3 1
4 2 4 223
5 3 5 546
6 4 6 12
7 2 1 20
8 4 3 30
9 6 2 3
10 3 5 45
11 7 6 456
12 3 7 65
13 8 8 234
14 1 9 45
15 3 2 1
16 4 3 223
17 6 6 546
18 3 4 12
19 8 7 20
20 9 5 30
21 11 6 3
22 12 6 45
23 14 9 456
24 15 10 65
....
And the new data set would look something like this:
ID 1Date 1Sum 2Date 2Sum 3Date 3Sum
1 1 234 2 45 3 1
2 1 20 4 223 NA NA
3 2 1 5 546 5 45
...
Thanks for your help!
I think you can do this with a bit if help from dplyr and tidyr
library(dplyr)
library(tidyr)
dd %>% group_by(ID) %>% mutate(seq=1:n()) %>%
pivot_wider("ID", names_from="seq", values_from = c("Date","Sum"))
Where dd is your sample data frame above.
I have learned imputation of NA values in r, we normally find the average (if it is numeric) of the data and put that in NA place of particular column. But i wanna ask that what should i do if instead of NA, the place is empty i.e. the cell is empty of any column.
Please help me.
Let's start with some test data:
person_id <- c("1","2","3","4","5","6","7","8","9","10")
inches <- as.numeric(c("56","58","60","62","64","","68","70","72","74"))
height <- data.frame(person_id,inches)
height
person_id inches
1 1 56
2 2 58
3 3 60
4 4 62
5 5 64
6 6 NA
7 7 68
8 8 70
9 9 72
10 10 74
The blank was already replaced with NA in height$inches.
You could also do this yourself:
height$inches[height$inches==""] <- NA
Now to fill in the NA with the average from the non-missing values of inches.
options(digits=4)
height$inches[is.na(height$inches)] <- mean(height$inches,na.rm=T)
height
person_id inches
1 1 56.00
2 2 58.00
3 3 60.00
4 4 62.00
5 5 64.00
6 6 64.89
7 7 68.00
8 8 70.00
9 9 72.00
10 10 74.00
I need to rank rows based on two variables and I just can't wrap my head around it.
Test data below:
df <- data.frame(A = c(12,35,55,7,6,NA,NA,NA,NA,NA), B = c(NA,12,25,53,12,2,66,45,69,43))
A B
12 NA
35 12
55 25
7 53
6 12
NA 2
NA 66
NA 45
NA 69
NA 43
I want to calculate a third variable, C that equals A when A!=NA. When A==NA then C==B, BUT the C score should always follow that a row with A==NA should never outrank a row with A!=NA.
In the data above Max(A) should equal max(C) and max(B) only can hold the sixth highest C value, because A has five non-NA values. If A ==NA and B outranks a row with A!=NA, then some form of transformation should take place that ensures that the A!=NA row always outranks the B row in the final C score
I would like the result to look something like this:
A B C
55 25 1
35 12 2
12 NA 3
7 53 4
6 12 5
NA 69 6
NA 66 7
NA 45 8
NA 43 9
NA 2 10
So far the closest I can get is
df$C <- ifelse(is.na(df$A), min(df$A, na.rm=T)/df$B, df$A)
But that turns the ranking upside down when A==NA, so B==2 is ranked 6 instead of B==69
A B C
55 25 1
35 12 2
12 NA 3
7 53 4
6 12 5
NA 2 6
NA 43 7
NA 45 8
NA 66 9
NA 69 10
I'm not sure if I could use some kind of weights?
Any suggestions are greatly appreciated! Thanks!
You can try:
df$C <- order(-df$A)
df[is.na(df$A),"C"] <- sort.list(order(-df[is.na(df$A),"B"]))+length(which(!is.na(df$A)))
and the order for C:
df[order(df$C),]