I have a file with 3 columns. the 1st column is ID, 2nd and 3rd are values for 2 conditions. in condition columns I have both - and + values. I would like to make 2 separate files. the 1st one would be for the negative values and the 2nd one would be for the positive values. do you know how to that in R?
Something like this ?
set.seed(1)
df1 <- data.frame(id=1:5,cond1 = sample(-100:100,5), cond2 = sample(-100:100,5))
df_neg <- df_pos <- df1
df_pos[,2:3][df1[,2:3]<0] <- NA # or 0, or NULL
df_neg[,2:3][df1[,2:3]>0] <- NA # or 0, or NULL
# > df1
# id cond1 cond2
# 1 1 -47 80
# 2 2 -26 88
# 3 3 13 31
# 4 4 79 24
# 5 5 -61 -88
# > df_pos
# id cond1 cond2
# 1 1 NA 80
# 2 2 NA 88
# 3 3 13 31
# 4 4 79 24
# 5 5 NA NA
# > df_neg
# id cond1 cond2
# 1 1 -47 NA
# 2 2 -26 NA
# 3 3 NA NA
# 4 4 NA NA
# 5 5 -61 -88
Related
I'm trying to reformat my dataset from long to wide format, but while this is one of the most discussed topics, I couldn't find a solution for my case, nor to generalize from methods others have used.
My data is in long format, where each ID has a different number of rows (relative to other IDs.) I want to transform to wide format where each ID has one row, and the data is represented by columns with a suffix that reflects the order each value appears per ID.
To illustrate:
Notice that the NAs values don't necessarily correspond between the two formats. In the long format, NAs are simply missing from data; but in the wide format, NAs appear where values for that id fall short in filling the number of values other IDs might have for the variable x.
My Data
In real life, my data has more than one variable, and it could come in one of two versions:
Version 1 :: For each ID, values appear at the same row across variables
## reproducible data
set.seed(125)
runs_per_id <- sample(5:9, 4, replace = TRUE)
id <- rep(1:4, times = runs_per_id)
set.seed(300)
is_value <- sample (c(0, 1), size = length(id), replace = TRUE)
x <- is_value
x[which(as.logical(is_value))] <- sample(1:100, size = sum(x))
y <- is_value
y[which(as.logical(is_value))] <- sample(1:100, size = sum(y))
z <- is_value
z[which(as.logical(is_value))] <- sample(1:100, size = sum(z))
d <- as.data.frame(cbind(id, x, y, z))
d[d == 0] <- NA
d
# id x y z
# 1 1 38 63 61
# 2 1 17 27 76
# 3 1 32 81 89
# 4 1 NA NA NA
# 5 1 75 2 53
# 6 1 NA NA NA
# 7 2 NA NA NA
# 8 2 40 75 4
# 9 2 NA NA NA
# 10 2 NA NA NA
# 11 2 28 47 70
# 12 2 NA NA NA
# 13 2 71 67 33
# 14 3 NA NA NA
# 15 3 95 26 82
# 16 3 NA NA NA
# 17 3 41 7 99
# 18 3 97 8 68
# 19 4 NA NA NA
# 20 4 NA NA NA
# 21 4 93 38 58
# 22 4 NA NA NA
# 23 4 NA NA NA
Version 2 :: For each ID, values don't necessarily appear at the same row across variables
## reproducible data based on generating d from above
set.seed(12)
d2 <- data.frame(replicate(3, sample(0:1,length(id),rep=TRUE)))
d2[d2 != 0] <- sample(1:100, size = sum(d2 != 0))
d2[d2 == 0] <- NA
colnames(d2) <- c("x", "y", "z")
d2 <- as.data.frame(cbind(id, d2))
d2
## id x y z
## 1 1 18 28 5
## 2 1 85 93 22
## 3 1 55 59 NA
## 4 1 NA NA 67
## 5 1 NA 15 77
## 6 1 58 NA NA
## 7 2 NA 7 NA
## 8 2 NA NA 91
## 9 2 88 14 NA
## 10 2 13 NA NA
## 11 2 32 NA NA
## 12 2 NA 80 71
## 13 2 40 74 69
## 14 3 NA NA NA
## 15 3 96 NA 76
## 16 3 NA NA NA
## 17 3 73 66 NA
## 18 3 52 NA NA
## 19 4 56 12 16
## 20 4 53 NA NA
## 21 4 NA 42 84
## 22 4 39 99 NA
## 23 4 NA 37 NA
The Output I'm looking for
Version 1's data
Version 2's data
Trying to figure this out
I've used dplyr::spread() and even the new experimental pivot_wider() (inspired by this solution), but couldn't get it to number the occurrences of values along the variable, to be represented in the column names.
Ideally, a single solution would address both data versions I presented. It basically just needs to be agnostic to the number of values each id has in each column, and let the data dictate... I think it's a simple problem, but I just can't wrap my head around this.
Thanks!!!
The following is a solution based on #A.Suliman comment.
library(tidyr)
library(dplyr)
d %>%
# Combine all values besides id in one column
gather(key, value, -id) %>%
# Filter rows without a value
filter(!is.na(value)) %>%
group_by(id, key) %>%
# Create a new key variable numbering the key for each id
mutate(key_new = paste0(key, seq_len(n()))) %>%
ungroup() %>%
select(-key) %>%
# Spread the data with the new key
spread(key_new, value)
# A tibble: 4 x 13
# id x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 38 17 32 75 63 27 81 2 61 76 89 53
# 2 2 40 28 71 NA 75 47 67 NA 4 70 33 NA
# 3 3 95 41 97 NA 26 7 8 NA 82 99 68 NA
# 4 4 93 NA NA NA 38 NA NA NA 58 NA NA NA
For d2 instead of d it gives:
# A tibble: 4 x 13
# id x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 1 18 85 55 58 28 93 59 15 5 22 67 77
# 2 2 88 13 32 40 7 14 80 74 91 71 69 NA
# 3 3 96 73 52 NA 66 NA NA NA 76 NA NA NA
# 4 4 56 53 39 NA 12 42 99 37 16 84 NA NA
I've got the following data frame df
time <- c("01/01/1951", "02/01/1951", "03/01/1951", "04/01/1951", "05/01/1951", "06/01/1951", "07/01/1951", "08/01/1951", "09/01/1951", "10/01/1951", "11/01/1951", "12/01/1951", "13/01/1951", "14/01/1951", "15/01/1951", "16/01/1951", "17/01/1951", "18/01/1951", "19/01/1951", "20/01/1951", "21/01/1951", "22/01/1951", "23/01/1951")
member <- c(1,NA,NA,3,NA,NA,NA,NA,NA,1,1,NA,2,NA,NA,NA,NA,NA,1,NA,NA,NA,NA)
df <- data.frame(time, member)
df$time = as.Date(df$time,format="%d/%m/%Y")
I like the day with an NA value for "member" before a day where member is 1 to become a 0, UNLESS there is a 1 on the day before a 1 (two consecutive ones), I wouldnt want the 1 to become a 0, just the NA values before a 1.
the desired data frame would be:
df
time member
1 01/01/1951 1
2 02/01/1951 NA
3 03/01/1951 NA
4 04/01/1951 3
5 05/01/1951 NA
6 06/01/1951 NA
7 07/01/1951 NA
8 08/01/1951 NA
9 09/01/1951 0
10 10/01/1951 1
11 11/01/1951 1
12 12/01/1951 NA
13 13/01/1951 2
14 14/01/1951 NA
15 15/01/1951 NA
16 16/01/1951 NA
17 17/01/1951 NA
18 18/01/1951 0
19 19/01/1951 1
20 20/01/1951 NA
21 21/01/1951 NA
22 22/01/1951 NA
23 23/01/1951 NA
ideas?
So we need to check if df$member is NA and the next value is 1. When both of those are true, we set df$member equal to 0:
df$member[is.na(df$member) & c(df$member[-1] == 1, FALSE)] = 0
df
# time member
# 1 1951-01-01 1
# 2 1951-01-02 NA
# 3 1951-01-03 NA
# 4 1951-01-04 3
# 5 1951-01-05 NA
# 6 1951-01-06 NA
# 7 1951-01-07 NA
# 8 1951-01-08 NA
# 9 1951-01-09 0
# 10 1951-01-10 1
# 11 1951-01-11 1
# 12 1951-01-12 NA
# 13 1951-01-13 2
# 14 1951-01-14 NA
# 15 1951-01-15 NA
# 16 1951-01-16 NA
# 17 1951-01-17 NA
# 18 1951-01-18 0
# 19 1951-01-19 1
# 20 1951-01-20 NA
# 21 1951-01-21 NA
# 22 1951-01-22 NA
# 23 1951-01-23 NA
So lets take the following data
set.seed(123)
A <- 1:10
age<- sample(20:50,10)
height <- sample(100:210,10)
df1 <- data.frame(A, age, height)
B <- c(1,1,1,2,2,3,3,5,5,5,5,8,8,9,10,10)
injury <- sample(letters[1:5],16, replace=T)
df2 <- data.frame(B, injury)
Now, we can merge the data using the following code:
df3 <- merge(df1, df2, by.x = "A", by.y = "B", all=T)
head(df3)
# A age height injury
# 1 1 28 206 e
# 2 1 28 206 d
# 3 1 28 206 d
# 4 2 43 149 e
# 5 2 43 149 d
# 6 3 31 173 d
But what i want in the new data frame is the length of injury's as a level variable.
So the desired output should look like this:
So in this simple example we know that the max length of injury's is 4 per unique df2$B . So we need 4 new columns.
Must my data has an unknown number, so a code is needed to generate the correct, so something like
length(unique(df2$injury[df2$B]))
but that is also not correct syntax, as the output should equal 4
I don't know where the letters are coming from in your sample output, because there are none in the variables in your sample input, but you can try something like:
library(splitstackshape)
dcast.data.table(getanID(df3, c("A", "age")), A + age + height ~
.id, value.var = "injury")
## A age height 1 2 3 4
## 1: 1 28 206 4 3 3 NA
## 2: 2 43 149 4 3 NA NA
## 3: 3 31 173 3 3 NA NA
## 4: 4 44 161 NA NA NA NA
## 5: 5 45 111 3 2 1 4
## 6: 6 21 195 NA NA NA NA
## 7: 7 33 125 NA NA NA NA
## 8: 8 41 104 4 3 NA NA
## 9: 9 32 133 4 NA NA NA
## 10: 10 30 197 1 2 NA NA
This adds a secondary ID based on the first two columns and then spreads it to a wide format.
If you want to accomplish this using the tidyr package, I found it necessary to create an index variable:
df3 %>%
group_by(A) %>%
mutate(ind = row_number()) %>%
spread(ind, injury)
My data is structured as follows:
DT <- data.table(Id=c(1,2,3,4,5), Va1=c(3,13,NA,NA,NA), Va2=c(4,40,NA,NA,4), Va3=c(5,34,NA,7,84),
Va4=c(2,23,NA,63,9), Vb1=c(8,45,1,7,0), Vb2=c(0,35,0,7,6), Vb3=c(63,0,0,0,5), Vc1=c(2,5,0,0,4))
>DT
Id Va1 Va2 Va3 Va4 Vb1 Vb2 Vb3 Vc1
1: 1 3 4 5 2 8 0 63 2
2: 2 13 40 34 23 45 35 0 5
3: 3 NA NA NA NA 1 0 0 0
4: 4 NA NA 7 63 7 7 0 0
5: 5 NA 4 84 9 0 6 5 4
additionally, I have a reference list that references all the column groups:
reference <- list(g.1=c(2,3,4,5), g.2=c(6,7,8), g.3=c(9))
Columns 2,3,4,5 (variables Va1, Va2, Va3, and Va4) belong to one group of variables. Columns 6,7,8 (variables Vb1, Vb2, Vb3) belong to a second group. Column 9 (variable Vc1) belongs to a third group.
What I need to do is calculate the difference between consecutive columns within column groups.
I.e. I need to find the difference between Va2 and Va1, and between Va3 and Va2, etc... but not between Vb1 and Va4.
The output should look like:
Id Va1 Va2 Va3 Va4 Vb1 Vb2 Vb3 Vc1 D[Va1:Va2] D[Va2:Va3] D[Va3:Va4] D[Vb1:Vb2] D[Vb2:Vb3]
1: 1 3 4 5 2 8 0 63 2 1 1 -3 -8 63
2: 2 13 40 34 23 45 35 0 5 27 -6 -11 -10 -35
3: 3 NA NA NA NA 1 0 0 0 NA NA NA -1 0
4: 4 NA NA 7 63 7 7 0 0 NA NA 56 0 -7
5: 5 NA 4 84 9 0 6 5 4 NA 80 -75 6 -1
Currently I am using the following loop:
for(i in 1:(length(reference)-1)){
tmp <- NULL
tmp <- as.list(reference[[i]])
tmp <- tmp[-length(tmp)]
tmp <- mapply(c, lapply(tmp, FUN = function(x) x+1), tmp, SIMPLIFY=FALSE)
for(j in 1:length(tmp)){
data <- cbind(data, delta = data[, tmp[[j]][1], with = F] - data[, tmp[[j]][2], with = F])
}
}
but my real data.table has 300-500 columns and +1'000'000 rows.
How can I make this more efficient?
I think your loop is fine, except you should use := instead of cbind to add columns:
ref <- lapply(reference,function(x) names(DT)[x])
for (g in ref){
if (length(g)==1) next
gx = tail(g,-1)
gy = head(g,-1)
gn = paste0("D[",gy,":",gx,"]")
DT[,(gn) := mapply(function(x,y).SD[[x]]-.SD[[y]], gx, gy, SIMPLIFY=FALSE)]
}
I am trying to shorten a chunk of code to make it faster and easier to modify. This is a short example of my data.
order obs year var1 var2 var3
1 3 1 1 32 588 NA
2 4 1 2 33 689 2385
3 5 1 3 NA 678 2369
4 33 3 1 10 214 1274
5 34 3 2 10 237 1345
6 35 3 3 10 242 1393
7 78 6 1 5 62 NA
8 79 6 2 5 75 296
9 80 6 3 5 76 500
10 93 7 1 NA NA NA
11 94 7 2 4 86 247
12 95 7 3 3 54 207
Basically, what I want is R to find any possible and unique combination of two values (observations) in column "obs", within the same year, to create a new matrix or DF with observations being the aggregation of the originals. Order is not important, so 1+6 = 6+1. For instance, having 150 observations, I will expect 11,175 feasible combinations (each year).
I sort of got what I want with basic coding but, as you will see, is way too long (I have built this way 66 different new data sets so it does not really make a sense) and I am wondering how to shorten it. I did some trials (plyr,...) with no real success. Here what I did:
# For the 1st year, groups of 2 obs
newmatrix <- data.frame(t(combn(unique(data$obs[data$year==1]), 2)))
colnames(newmatrix) <- c("obs1", "obs2")
newmatrix$name <- do.call(paste, c(newmatrix[c("obs1", "obs2")], sep = "_"))
# and the aggregation of var. using indexes, which I will skip here to save your time :)
To ilustrate, here the result, considering above sample, of what I would get for the 1st year. NA is because I only computed those where the 2 values were valid. And only for variables 1 and 3. More, I did the sum but it could be any other possible Function:
order obs1 obs2 year var1 var3
1 1 1 3 1_3 42 NA
2 2 1 6 1_6 37 NA
3 3 1 7 1_7 NA NA
4 4 3 6 3_6 15 NA
5 5 3 7 3_7 NA NA
6 6 6 7 6_7 NA NA
As for the 2 first lines in the 3rd year, same type of matrix:
order obs1 obs2 year var1 var3
1 1 1 3 1_3 NA 3762
2 2 1 6 1_6 NA 2868
.......... etc ............
I hope I explained myself. Thank you in advance for your hints on how to do this more efficient.
I would use split-apply-combine to split by year, find all the combinations, and then combine back together:
do.call(rbind, lapply(split(data, data$year), function(x) {
p <- combn(nrow(x), 2)
data.frame(order=paste(x$order[p[1,]], x$order[p[2,]], sep="_"),
obs1=x$obs[p[1,]],
obs2=x$obs[p[2,]],
year=x$year[1],
var1=x$var1[p[1,]] + x$var1[p[2,]],
var2=x$var2[p[1,]] + x$var2[p[2,]],
var3=x$var3[p[1,]] + x$var3[p[2,]])
}))
# order obs1 obs2 year var1 var2 var3
# 1.1 3_33 1 3 1 42 802 NA
# 1.2 3_78 1 6 1 37 650 NA
# 1.3 3_93 1 7 1 NA NA NA
# 1.4 33_78 3 6 1 15 276 NA
# 1.5 33_93 3 7 1 NA NA NA
# 1.6 78_93 6 7 1 NA NA NA
# 2.1 4_34 1 3 2 43 926 3730
# 2.2 4_79 1 6 2 38 764 2681
# 2.3 4_94 1 7 2 37 775 2632
# 2.4 34_79 3 6 2 15 312 1641
# 2.5 34_94 3 7 2 14 323 1592
# 2.6 79_94 6 7 2 9 161 543
# 3.1 5_35 1 3 3 NA 920 3762
# 3.2 5_80 1 6 3 NA 754 2869
# 3.3 5_95 1 7 3 NA 732 2576
# 3.4 35_80 3 6 3 15 318 1893
# 3.5 35_95 3 7 3 13 296 1600
# 3.6 80_95 6 7 3 8 130 707
This enables you to be very flexible in how you combine data pairs of observations within a year --- x[p[1,],] represents the year-specific data for the first element in each pair and x[p[2,],] represents the year-specific data for the second element in each pair. You can return a year-specific data frame with any combination of data for the pairs, and the year-specific data frames are combined into a single final data frame with do.call and rbind.