In the below, the data frame index denotes the value while t1:t2 denotes the number of times that specific value was recorded at a specific point in time. For example index 10 at t1 equals 1 suggesting that it was made 1 records; at t2 there are 4 records, whole at t3 and t4 just 1. I would like to return the values from columns t1:t4 based on an index column
Input:
index t1 t2 t3 t4
10 1 4 1 1
20 2 5 1 0
30 3 6 1 0
40 0 0 0 2
Output:
t1 t2 t3 t4
10 10 10 10
20 10 20 40
20 10 30 40
30 10 NA NA
30 20 NA NA
30 20 NA NA
NA 20 NA NA
NA 20 NA NA
NA 30 NA NA
NA 30 NA NA
NA 30 NA NA
NA 30 NA NA
NA 30 NA NA
NA 30 NA NA
Sample data:
df<-structure(list(index=c (10,20,30,40),
t1 = c(1, 2, 3, 0),
t2 = c(4, 5, 6, 0),
t3 = c(1, 1,1, 0),
t4 = c(1, 0, 0, 2)), row.names = c(NA,4L), class = "data.frame")
df
One dplyr, tidyr and purrr solution could be:
map(.x = names(df)[-1],
~ df %>%
uncount(get(.x)) %>%
select(!!.x := index) %>%
rowid_to_column()) %>%
reduce(full_join)
rowid t1 t2 t3 t4
1 1 10 10 10 10
2 2 20 10 20 40
3 3 20 10 30 40
4 4 30 10 NA NA
5 5 30 20 NA NA
6 6 30 20 NA NA
7 7 NA 20 NA NA
8 8 NA 20 NA NA
9 9 NA 20 NA NA
10 10 NA 30 NA NA
11 11 NA 30 NA NA
12 12 NA 30 NA NA
13 13 NA 30 NA NA
14 14 NA 30 NA NA
15 15 NA 30 NA NA
Base R and one line of code.
Map(function(x) rep(df$index, x), df[,-1])
After update:
maxy <- max(apply(df[,-1], 2, sum))
data.frame(Map(function(x) c(rep(df$index, x), rep(NA, maxy - sum(x))), df[,-1]))
Using base R with lapply
lst1 <- lapply(df[-1], function(x) rep(df$index, x))
data.frame(lapply(lst1, `length<-`, max(lengths(lst1))))
-output
# t1 t2 t3 t4
#1 10 10 10 10
#2 20 10 20 40
#3 20 10 30 40
#4 30 10 NA NA
#5 30 20 NA NA
#6 30 20 NA NA
#7 NA 20 NA NA
#8 NA 20 NA NA
#9 NA 20 NA NA
#10 NA 30 NA NA
#11 NA 30 NA NA
#12 NA 30 NA NA
#13 NA 30 NA NA
#14 NA 30 NA NA
#15 NA 30 NA NA
Here is a base R option
list2DF(
lapply(
df[-1],
function(x) `length<-`(rep(df$index, x), max(colSums(df[-1])))
)
)
which gives
t1 t2 t3 t4
1 10 10 10 10
2 20 10 20 40
3 20 10 30 40
4 30 10 NA NA
5 30 20 NA NA
6 30 20 NA NA
7 NA 20 NA NA
8 NA 20 NA NA
9 NA 20 NA NA
10 NA 30 NA NA
11 NA 30 NA NA
12 NA 30 NA NA
13 NA 30 NA NA
14 NA 30 NA NA
15 NA 30 NA NA
Related
I have three single-line dataframes with different numbers and names of columns...
df1:
0 3 6 7 10 14 17
2 18 9 1 14 2 1 1
df2:
0 3 7 9 10 13 14 17 21 35
2 10 4 8 1 5 2 11 2 1 1
df3:
0 3 7 10 12
2 7 3 11 3 1
...and I have a master dataframe.
CREATION CODE
masterdf <- data.frame(matrix(ncol = 50, nrow = 0))
colnames(masterdf) <- c('0',2:50)
0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
I want to take each of the smaller dataframes and put one per row into the master dataframe with the values in the matching columns. When finished, the updated master dataframe will look like this:
0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
1 18 NA 9 NA NA 1 14 NA NA 2 NA NA NA 1 NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 10 NA 4 NA NA NA 8 NA 1 5 NA NA 2 11 NA NA 2 NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA
3 7 NA 3 NA NA NA 11 NA NA 3 NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Yes, the column names do need to remain as numbers. As you can see, the number of columns varies with each of the numbered dataframes.
Other notes:
The first column name is 0 and the second column name is 2.
The 0 column will ALWAYS have a value in it in every dataframe.
The row number (2) in each numbered dataframe is superfluous for my purposes.
I've tried nested loops without success.
My use case will end up with thousands of rows in the master dataframe.
Thoughts?
You can simply use the function rbindlist from data.table with fill = T
data.table::rbindlist(list(masterdf, df1, df2, df3), fill = T)
Results
0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
1: 18 NA 9 NA NA 1 14 NA NA 2 NA NA NA 1 NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2: 10 NA 4 NA NA NA 8 NA 1 5 NA NA 2 11 NA NA 2 NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3: 7 NA 3 NA NA NA 11 NA NA 3 NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
data
masterdf <- data.frame(matrix(ncol = 50, nrow = 0))
colnames(masterdf) <- c('0',2:50)
df1 <- data.frame(t(data.frame("2" = c(18,9,1,14,2,1,1))))
colnames(df1) <- c(0,3,6,7,10,14,17)
df2 <- data.frame(t(data.frame("2" = c(10,4,8,1,5,2,11,2,1,1))))
colnames(df2) <- c(0,3,7,9,10,13,14,17,21,35)
df3 <- data.frame(t(data.frame("2" = c(7,3,11,3,1))))
colnames(df3) <- c(0,3,7,10,12)
Two attempts:
basic for loop, which might be a bit slow with many rows:
df_list <- list(df1,df2,df3)
for(i in seq_along(df_list)) {
masterdf[i, names(df_list[[i]])] <- df_list[[i]]
}
vectorised approach using matrix indexing and a single assignment to all matching rows and columns
df_list <- list(df1,df2,df3)
masterdf[seq_along(df_list),] <- NA
masterdf[cbind(
rep(seq_along(df_list), lengths(df_list)),
match(unlist(lapply(df_list, names)), names(masterdf))
)] <- unlist(df_list)
I think you can try the match function. It is a base R function. See the quick example below:
?match
match("2", c("1","2","3"))
I try to apply a function to a column of a dataframe but when I do this i got a column full of NA values. I don't understand why.
Here is my code :
courbe <- function(x) exp(coef(regression)[1]*x+coef(regression[2]))
dataT[,c(2)] <- courbe(dataT[,c(1)])
And here my dataframe :
DateRep Cases
1 25 NA
2 24 NA
3 23 NA
4 22 NA
5 21 NA
6 20 NA
7 19 NA
8 18 NA
9 17 NA
10 16 NA
11 15 NA
12 14 NA
13 13 NA
14 12 NA
15 11 NA
16 10 NA
17 9 NA
18 8 NA
19 7 NA
20 6 NA
21 5 NA
22 4 NA
23 3 NA
24 2 NA
25 1 NA
26 0 NA
The output of print(coef(regression)) :
Coefficients:
(Intercept) dataT$DateRep
2.7095 0.2211
As figured out in the comments, the mistake was in the placement of indices coef(regression)[1] and coef(regression[2]).
I need to create new columns in a data.table based on criteria set relative to some of the existing columns. I encountered some problems with missing data, however. Specifically, for each person a few datapoints are missing. For some individuals though the entire data of a questionnaire is missing (see column p == 3 or 4 in example data below). In such cases (= entire data of a questionnaire missing) I would like data.table to enter NA in the output for this particular person. I have tried resolving this using if_else from the dplyrpackage. However, data.table returns NaN or 0 instead of NAas a result even when all data of a person is missing (i.e. when column p is 3 or 4).
This is my current script, which only partially produces the desired output (i.e. correct output for p== 1 or 2, but not for p== 3 or 4).
library(data.table)
library(dplyr)
# Create example datatable
set.seed(4)
p <- c(rep(1, 5), rep(2, 5), rep(3, 5), rep(4, 5))
time1 <- as.integer(c(sample(1:20, 5, replace=TRUE), sample(21:40, 5, replace=TRUE), rep("NA",10)))
closeness1 <- as.integer(c(NA, NA, sample(c(1:40,NA), 7, replace=TRUE), NA, rep("NA",10)))
dt <- data.table::data.table(p, time1, closeness1)
# Compute new columns
dt[, c("mean1", "sum1") := .(
dplyr::if_else(sum(is.na(.SD[time1,]))==length(.SD[time1,]) | sum(is.na(.SD[closeness1,]))==length(.SD[closeness1,]),
as.numeric(NA), .SD[time1 <= 10, mean(closeness1, na.rm=TRUE)]),
dplyr::if_else(sum(is.na(.SD[time1,]))==length(.SD[time1,]) | sum(is.na(.SD[closeness1,]))==length(.SD[closeness1,]),
as.integer(NA), .SD[time1 <= 10, sum(closeness1, na.rm=TRUE)])),
by = p, .SDcols = c("time1", "closeness1")]
The following script produces the output I would want to see. However, this is obviously just for illustrative purposes and I would need to know how to modify the above script to produce the desired outcome:
# Select rows from original data that were as intended
p12 <- dplyr::filter(dt, p %in% c(1,2))
# Create new data.table with corrected output
p <- c(rep(3, 5), rep(4, 5))
time1 <- as.integer(rep("NA",10))
closeness1 <- as.integer(rep("NA",10))
mean1 <- as.integer(rep("NA",10))
sum1 <- as.integer(rep("NA",10))
dt.des <- data.table::data.table(p, time1, closeness1, mean1, sum1)
# Desired output
dsrd.opt <- dplyr::bind_rows(p12, dt.des)
dsrd.opt
p time1 closeness1 mean1 sum1
1 1 12 NA 21.5 43
2 1 1 NA 21.5 43
3 1 6 31 21.5 43
4 1 6 12 21.5 43
5 1 17 5 21.5 43
6 2 26 40 NaN 0
7 2 35 18 NaN 0
8 2 39 19 NaN 0
9 2 39 40 NaN 0
10 2 22 NA NaN 0
11 3 NA NA NA NA
12 3 NA NA NA NA
13 3 NA NA NA NA
14 3 NA NA NA NA
15 3 NA NA NA NA
16 4 NA NA NA NA
17 4 NA NA NA NA
18 4 NA NA NA NA
19 4 NA NA NA NA
20 4 NA NA NA NA
Edit:
It looks like I simplified the above example too much. I basically need to compute the mean of closeness1 based on two separate conditions, once for time1 <= 10 and once for time1 > 10 & time1 <= 21. The respective output should then be saved in two new columns. I have updated the example script accordingly, see below:
dt[, c("mean1", "mean2") := .(
dplyr::if_else(sum(is.na(.SD[time1,]))==length(.SD[time1,]) | sum(is.na(.SD[closeness1,]))==length(.SD[closeness1,]),
as.numeric(NA), .SD[time1 <= 10, mean(closeness1, na.rm=TRUE)]),
dplyr::if_else(sum(is.na(.SD[time1,]))==length(.SD[time1,]) | sum(is.na(.SD[closeness1,]))==length(.SD[closeness1,]),
as.numeric(NA), .SD[time1 > 10 & time1 <= 21, mean(closeness1, na.rm=TRUE)])),
by = p, .SDcols = c("time1", "closeness1")]
Updated example output
dsrd.opt
p time1 closeness1 mean1 mean2
1 1 12 NA 21.5 5
2 1 1 NA 21.5 5
3 1 6 31 21.5 5
4 1 6 12 21.5 5
5 1 17 5 21.5 5
6 2 26 40 NaN NaN
7 2 35 18 NaN NaN
8 2 39 19 NaN NaN
9 2 39 40 NaN NaN
10 2 22 NA NaN NaN
11 3 NA NA NA NA
12 3 NA NA NA NA
13 3 NA NA NA NA
14 3 NA NA NA NA
15 3 NA NA NA NA
16 4 NA NA NA NA
17 4 NA NA NA NA
18 4 NA NA NA NA
19 4 NA NA NA NA
20 4 NA NA NA NA
If I understood you correctly, I would suggest to use a simple left join. I think this is pretty straigthforward and produces the desired result.
dt_result <- merge(x = dt
, y = dt[time1 <= 10, .(mean1 = mean(closeness1, na.rm = TRUE)
, sum1 = sum(closeness1, na.rm = TRUE)), by = list(p)]
, by.x = "p"
, by.y = "p"
, all.x = TRUE
)
> dt_result
p time1 closeness1 mean1 sum1
1: 1 12 NA 21.5 43
2: 1 1 NA 21.5 43
3: 1 6 31 21.5 43
4: 1 6 12 21.5 43
5: 1 17 5 21.5 43
6: 2 26 40 NA NA
7: 2 35 18 NA NA
8: 2 39 19 NA NA
9: 2 39 40 NA NA
10: 2 22 NA NA NA
11: 3 NA NA NA NA
12: 3 NA NA NA NA
13: 3 NA NA NA NA
14: 3 NA NA NA NA
15: 3 NA NA NA NA
16: 4 NA NA NA NA
17: 4 NA NA NA NA
18: 4 NA NA NA NA
19: 4 NA NA NA NA
20: 4 NA NA NA NA
I have the following dataframe in R
df<-data.frame(
"Val1"=seq(from=1, to=40, by=5), 'Val2'=c(2,4,2,5,11,3,5,3),
"Val3"=seq(from=5, to=40, by=5), "Val4"=c(3,5,7,3,7,5,7,8))
The resulting dataframe looks as follows. Val 1, Val3 are the causal variables and Val2, Val4 are the dependent variables
Val1 Val2 Val3 Val4
1 1 2 5 3
2 6 4 10 5
3 11 2 15 7
4 16 5 20 3
5 21 11 25 7
6 26 3 30 5
7 31 5 35 7
8 36 3 40 8
I wish to obtain the following dataframe as an output
Val1 Val2 Val3 Val4
1 1 2 1 NA
2 2 NA 2 NA
3 3 NA 3 3
4 4 NA 4 NA
5 5 NA 5 NA
6 6 4 6 NA
7 7 NA 7 NA
8 8 NA 8 NA
9 9 NA 9 NA
10 10 NA 10 5
11 11 2 11 NA
12 12 NA 12 NA
13 13 NA 13 NA
14 14 NA 14 NA
15 15 NA 15 7
16 16 5 16 NA
17 17 NA 17 NA
18 18 NA 18 NA
19 19 NA 19 NA
20 20 NA 20 3
21 21 11 21 NA
22 22 NA 22 NA
23 23 NA 23 NA
24 24 NA 24 NA
25 25 NA 25 7
26 26 3 26 NA
27 27 NA 27 NA
28 28 NA 28 NA
29 29 NA 29 NA
30 30 NA 30 5
31 31 5 31 NA
32 32 NA 32 NA
33 33 NA 33 NA
34 34 NA 34 NA
35 35 NA 35 7
36 36 3 36 NA
37 37 NA 37 NA
38 38 NA 38 NA
39 39 NA 39 NA
40 40 NA 40 8
How do I accomplish this. I have created the following code but it involves creating a second dataframe and then copying data from the first to the second. Is there a way to overwrite the existing dataframe. I would like to avoid loops
df2<-data.frame('Val1'=
seq(from=min(na.omit(c(df$Val1, df$Val3))), to= max(na.omit(c(df$Val1,
df$Val3))), by=1), "Val3"=seq(from=min(na.omit(c(df$Val1, df$Val3))), to=
max(na.omit(c(df$Val1, df$Val3))), by=1))
###### Create two loops
for(i in df$Val1){
for(j in df2$Val1){
if(i==j){
df2$Val2[df2$Val1==j]=df$Val2[df$Val1==i]
} else{df2$Val2[df2$Val1==j]=NA}}}
for(i in df$Val3){ for(j in df2$Val3){
if(i==j){df2$Val4[df2$Val3==j]=df$Val4[df$Val3==i]
} else{df2$Val4[df2$Val3==j]=NA}}}
Is there a faster vectorised way to accomplish the same. requesting some one to help
Assuming there's a slight error in your output example (row 3 should show NA for Val4 and the 3 in row 3 should be in row 5), this works:
library(tidyverse)
df_new <- bind_cols(
df %>%
select(Val1, Val2) %>%
complete(., expand(., Val1 = 1:40)),
df %>%
select(Val3, Val4) %>%
complete(., expand(., Val3 = 1:40))
)
> df_new
# A tibble: 40 x 4
Val1 Val2 Val3 Val4
<dbl> <dbl> <dbl> <dbl>
1 1 2 1 NA
2 2 NA 2 NA
3 3 NA 3 NA
4 4 NA 4 NA
5 5 NA 5 3
6 6 4 6 NA
7 7 NA 7 NA
8 8 NA 8 NA
9 9 NA 9 NA
10 10 NA 10 5
# ... with 30 more rows
We use bind_cols() to put together two parts of the dataframe:
First we select the first two columns, expand() the causal variable and complete() the data, then we do it again for the third and fourth column.
I have an example dataset
newdata<-data.frame(Tow.y=c(21,"NA","NA","NA","NA","NA",22,"NA","NA","NA","NA","NA",23,"NA","NA"),Tow=c("NA","NA","NA",21,"NA","NA","NA","NA",22,"NA","NA","NA","NA","NA",23))
newdata$Tow.y<-as.numeric(as.character(newdata$Tow.y))
newdata$Tow<-as.numeric(as.character(newdata$Tow))
newdata1<-newdata %>%
mutate(Station = coalesce(Tow.y, Tow))
newdata1
The resulting code produces:
Tow.y Tow Station
1 21 NA 21
2 NA NA NA
3 NA NA NA
4 NA 21 21
5 NA NA NA
6 NA NA NA
7 22 NA 22
8 NA NA NA
9 NA 22 22
10 NA NA NA
11 NA NA NA
12 NA NA NA
13 23 NA 23
14 NA NA NA
15 NA 23 23
I would like to fill in NAs for NAs between unique values in Station. So NAs in between the two 21 values would be 21, the NAs in between the 22s would be 22, etc. The NAs in between consecutive numbers would remain NAs.
Like this:
Tow.y Tow Station
1 21 NA 21
2 NA NA 21
3 NA NA 21
4 NA 21 21
5 NA NA NA
6 NA NA NA
7 22 NA 22
8 NA NA 22
9 NA 22 22
10 NA NA NA
11 NA NA NA
12 NA NA NA
13 23 NA 23
14 NA NA 23
15 NA 23 23
I have tried the na.locf function in the zoo package, but that replaces all NA values.
newdata1$Station2<-na.locf(newdata1$Station,na.rm = F)
Other examples I have looked at show that you can use na.locf with a group variable, but I dont have a grouping variable that is complete for the data set. Does anyone have a method for filling in the NAs where I need them to be filled in.
Here's a good way. I left the helper columns in to demonstrate how it works, but you can easily remove them with a select.
newdata1 %>%
mutate(from_first = zoo::na.locf(Station, na.rm = FALSE),
from_last = zoo::na.locf(Station, na.rm = FALSE, fromLast = TRUE),
result = if_else(from_first == from_last, from_first, Station))
# Tow.y Tow Station from_first from_last result
# 1 21 NA 21 21 21 21
# 2 NA NA NA 21 21 21
# 3 NA NA NA 21 21 21
# 4 NA 21 21 21 21 21
# 5 NA NA NA 21 22 NA
# 6 NA NA NA 21 22 NA
# 7 22 NA 22 22 22 22
# 8 NA NA NA 22 22 22
# 9 NA 22 22 22 22 22
# 10 NA NA NA 22 23 NA
# 11 NA NA NA 22 23 NA
# 12 NA NA NA 22 23 NA
# 13 23 NA 23 23 23 23
# 14 NA NA NA 23 23 23
# 15 NA 23 23 23 23 23
Based on the example, it seems that the 'Tow' and 'Tow.y' values match in a 'start', 'end' way. In that case, we can use base R methods.
Create a sequence index ('i1') to replicate the non-NA elements in 'Tow' (or 'Tow.y') for the 'Station' column. The 'lst' returns a list of numeric index, which is used to assign the values to 'Station'
lst <- do.call(Map, c(f = seq, unname(lapply(newdata,
function(x) seq_along(x)[!is.na(x)]))))
i1 <- unlist(lst)
newdata$Station[i1] <- rep(na.omit(newdata$Tow), lengths(lst))
newdata
# Tow.y Tow Station
#1 21 NA 21
#2 NA NA 21
#3 NA NA 21
#4 NA 21 21
#5 NA NA NA
#6 NA NA NA
#7 22 NA 22
#8 NA NA 22
#9 NA 22 22
#10 NA NA NA
#11 NA NA NA
#12 NA NA NA
#13 23 NA 23
#14 NA NA 23
#15 NA 23 23
Or using the same logic with tidyverse
library(tidyverse)
newdata %>%
mutate_all(funs(row_number() * !is.na(.))) %>%
map( ~ .x[.x!=0]) %>%
transpose %>%
map(reduce, `:`) %>%
set_names(na.omit(newdata$Tow)) %>%
stack %>%
right_join(newdata %>% mutate(values = row_number())) %>%
rename(Station = ind) %>%
ungroup %>%
select(names(newdata), everything(), -values)
# Tow.y Tow Station
#1 21 NA 21
#2 NA NA 21
#3 NA NA 21
#4 NA 21 21
#5 NA NA <NA>
#6 NA NA <NA>
#7 22 NA 22
#8 NA NA 22
#9 NA 22 22
#10 NA NA <NA>
#11 NA NA <NA>
#12 NA NA <NA>
#13 23 NA 23
#14 NA NA 23
#15 NA 23 23