This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 3 years ago.
Spread row values into columns
Data looks like
> head(df2)
ID fungi Conc Abs date_no
1 1 R3 2.500000 0.209 0
22 1 R3 1.250000 0.153 0
43 1 R3 0.625000 0.159 0
64 1 R3 0.312500 0.164 0
85 1 R3 0.156250 0.157 0
106 1 R3 0.078125 0.170 0
And I used this function, which spread the date column into three columns but didn't populate them correctly.
separate_DF <- spread(df2, "date_no", "Abs")
What I get is this...
> head(df3)
ID fungi Conc date_no_0 date_no_1 date_no_3
1 1 R3 0.01953125 0.162 NA NA
2 1 R3 0.03906253 0.169 NA NA
3 1 R3 0.07812500 0.170 NA NA
4 1 R3 0.15625000 0.157 NA NA
5 1 R3 0.31250000 0.164 NA NA
6 1 R3 0.62500000 0.159 NA NA
So that the three date columns are populated by the Abs values. And each fungi at each concentration is its own row.
Try this one,
library(tidyr)
txt <- "fungi date Abs Conc
1 1 x 2.5
1 2 x 2.5
1 3 x 2.5
2 1 x 2.5
2 2 x 2.5
2 3 x 2.5
"
date_df <- read.table(textConnection(txt), header = TRUE)
print(spread(date_df, date, Abs, sep=""))
Result:
fungi Conc date1 date2 date3
1 1 2.5 x x x
2 2 2.5 x x x
Related
I have a data frame of three columns Distance, Age, and Value where there are three repeated Value for every Distance and Age combination. I would like to generate a random number for Value for certain Distance and Age combinations. I can get a random number to generate however, it is the same random number repeated and I need three different random numbers.
Example Data
set.seed(321)
dat <- data.frame(matrix(ncol = 3, nrow = 27))
colnames(dat)[1:3] <- c("Distance", "Age", "Value")
dat$Distance <- rep(c(0.5,1.5,2.5), each = 9)
dat$Age <- rep(1:3, times = 9)
The code below creates a random number for the Distance and Age combo but the random number is the same for each of the three measurements, they should be different random numbers.
dat$Value <- ifelse(dat$Distance == '0.5' & dat$Age == '1',
rep(rnorm(3,10,2),3), NA)
Instead of getting the same repeated random number for the Distance and Age combo
head(dat)
Distance Age Value
1 0.5 1 13.40981
2 0.5 2 NA
3 0.5 3 NA
4 0.5 1 13.40981
5 0.5 2 NA
6 0.5 3 NA
I would like different random numbers for the Distance and Age combo
head(dat)
Distance Age Value
1 0.5 1 13.40981
2 0.5 2 NA
3 0.5 3 NA
4 0.5 1 11.18246
5 0.5 2 NA
6 0.5 3 NA
The numbers for Value don't really matter and are for demonstration purposes only.
Replace rep(rnorm(3,10,2),3) with rnorm(nrow(dat), 10, 2).
Something like this?
library(dplyr)
dat %>%
mutate(Value = ifelse(Distance == 0.5 & Age == 1, sample(1000,nrow(dat), replace = TRUE), NA))
Distance Age Value
1 0.5 1 478
2 0.5 2 NA
3 0.5 3 NA
4 0.5 1 707
5 0.5 2 NA
6 0.5 3 NA
7 0.5 1 653
8 0.5 2 NA
9 0.5 3 NA
10 1.5 1 NA
11 1.5 2 NA
12 1.5 3 NA
13 1.5 1 NA
14 1.5 2 NA
15 1.5 3 NA
16 1.5 1 NA
17 1.5 2 NA
18 1.5 3 NA
19 2.5 1 NA
20 2.5 2 NA
21 2.5 3 NA
22 2.5 1 NA
23 2.5 2 NA
24 2.5 3 NA
25 2.5 1 NA
26 2.5 2 NA
27 2.5 3 NA
You can eliminate the ifelse():
idx <- dat$Distance == '0.5' & dat$Age == '1'
dat$Value[idx] <- rnorm(sum(idx), 10, 2)
head(dat)
head(dat, 7)
# Distance Age Value
# 1 0.5 1 10.91214
# 2 0.5 2 NA
# 3 0.5 3 NA
# 4 0.5 1 10.84067
# 5 0.5 2 NA
# 6 0.5 3 NA
# 7 0.5 1 11.15517
This question already has answers here:
Replace NA with previous or next value, by group, using dplyr
(5 answers)
Replacing NAs with latest non-NA value
(21 answers)
Closed 2 years ago.
Just say we run a study where participants are measured on some outcome variable four times each. At the start of testing they provide their age and sex. Here is some toy data to illustrate.
set.seed(1)
sex <- NA
age <- NA
df <- data.frame(id = factor(rep(1:4,each=4)),
time = rep(1:4,times=4),
sex = as.vector(sapply(0:3, function(i) sex[i*4 + 1:4] <- c(sample(c("m", "f"), 1, replace = T), rep(NA,3)))),
age = as.vector(sapply(0:3, function(i) age[i*4 + 1:4] <- c(sample(18:75, 1, replace = T), rep(NA,3)))),
outcome = round(rnorm(16),2),
stringsAsFactors = F)
Here is what the data looks like
df
# output
# id time sex age outcome
# 1 1 m 29 0.33
# 1 2 <NA> NA -0.82
# 1 3 <NA> NA 0.49
# 1 4 <NA> NA 0.74
# 2 1 m 70 0.58
# 2 2 <NA> NA -0.31
# 2 3 <NA> NA 1.51
# 2 4 <NA> NA 0.39
# 3 1 f 72 -0.62
# 3 2 <NA> NA -2.21
# 3 3 <NA> NA 1.12
# 3 4 <NA> NA -0.04
# 4 1 f 56 -0.02
# 4 2 <NA> NA 0.94
# 4 3 <NA> NA 0.82
# 4 4 <NA> NA 0.59
Now what I want to do is to use the tidyverse to apply the values for the demographic variables, at present only on the first row of each participant's data, to all the rows.
At present all I could come up with was
df %>% group_by(id) %>% # group by id
distinct(sex) %>% # shrink to unique values for each id
dplyr::filter(!is.na(sex)) %>% # remove the NAs
left_join(df, by = "id")
Which yields the output
# A tibble: 16 x 6
# Groups: id [4]
# sex.x id time sex.y age outcome
# <chr> <fct> <int> <chr> <int> <dbl>
# 1 m 1 1 m 29 0.33
# 2 m 1 2 NA NA -0.82
# 3 m 1 3 NA NA 0.49
# 4 m 1 4 NA NA 0.74
# 5 m 2 1 m 70 0.580
# 6 m 2 2 NA NA -0.31
# 7 m 2 3 NA NA 1.51
# 8 m 2 4 NA NA 0.39
# 9 f 3 1 f 72 -0.62
# 10 f 3 2 NA NA -2.21
# 11 f 3 3 NA NA 1.12
# 12 f 3 4 NA NA -0.04
# 13 f 4 1 f 56 -0.02
# 14 f 4 2 NA NA 0.94
# 15 f 4 3 NA NA 0.82
# 16 f 4 4 NA NA 0.59
Now I would consider this partially successful because the first row in each participant's sex.x column has now been applied to all their other rows, but I really don't like that there are now two sex columns.
Now I could easily add some more functions to the chain that remove the superfluous sex.y column and rename the sex.x column to its original form, but this seems a bit clunky.
Can anyone suggest how to do this better?
You can fill the sex value for each id :
library(dplyr)
df %>% group_by(id) %>% tidyr::fill(sex)
# id time sex age outcome
# <fct> <int> <chr> <int> <dbl>
# 1 1 1 m 51 -1.54
# 2 1 2 m NA -0.93
# 3 1 3 m NA -0.290
# 4 1 4 m NA -0.01
# 5 2 1 f 40 2.4
# 6 2 2 f NA 0.76
# 7 2 3 f NA -0.8
# 8 2 4 f NA -1.15
# 9 3 1 m 60 -0.290
#10 3 2 m NA -0.3
#11 3 3 m NA -0.41
#12 3 4 m NA 0.25
#13 4 1 m 31 -0.89
#14 4 2 m NA 0.44
#15 4 3 m NA -1.24
#16 4 4 m NA -0.22
You could also fill age value.(df %>% group_by(id) %>% tidyr::fill(sex, age)).
PS - I get different numbers from the same seed value though.
I have a unbalanced data frame with date, localities and prices. I would like calculate diff price among diferents localities by date. My data its unbalanced and to get all diff price I think in create data(localities) to balance data.
My data look like:
library(dplyr)
set.seed(123)
df= data.frame(date=(1:3),
locality= rbinom(21,3, 0.2),
price=rnorm(21, 50, 20))
df %>%
arrange(date, locality)
> date locality price
1 1 0 60.07625
2 1 0 35.32994
3 1 0 63.69872
4 1 1 54.76426
5 1 1 66.51080
6 1 1 28.28602
7 1 2 47.09213
8 2 0 26.68910
9 2 1 100.56673
10 2 1 48.88628
11 2 1 48.29153
12 2 2 29.02214
13 2 2 45.68269
14 2 2 43.59887
15 3 0 60.98193
16 3 0 75.89527
17 3 0 43.30174
18 3 0 71.41221
19 3 0 33.62969
20 3 1 34.31236
21 3 1 23.76955
To get balanced data I think in:
> date locality price
1 1 0 60.07625
2 1 0 35.32994
3 1 0 63.69872
4 1 1 54.76426
5 1 1 66.51080
6 1 1 28.28602
7 1 2 47.09213
8 1 2 NA
9 1 2 NA
10 2 0 26.68910
10 2 0 NA
10 2 0 NA
11 2 1 100.56673
12 2 1 48.88628
13 2 1 48.29153
14 2 2 29.02214
15 2 2 45.68269
16 2 2 43.59887
etc...
Finally to get diff price beetwen pair localities I think:
> date diff(price, 0-1) diff(price, 0-2) diff(price, 1-2)
1 1 60.07625-54.76426 60.07625-47.09213 etc...
2 1 35.32994-66.51080 35.32994-NA
3 1 63.69872-28.28602 63.69872-NA
You don't need to balance your data. If you use dcast, it will add the NAs for you.
First transform the data to show individual columns for each locality
library(data.table)
library(tidyverse)
setDT(df)
df[, rid := rowid(date, locality)]
df2 <- dcast(df, rid + date ~ locality, value.var = 'price')
# rid date 0 1 2
# 1: 1 1 60.07625 54.76426 47.09213
# 2: 1 2 26.68910 100.56673 29.02214
# 3: 1 3 60.98193 34.31236 NA
# 4: 2 1 35.32994 66.51080 NA
# 5: 2 2 NA 48.88628 45.68269
# 6: 2 3 75.89527 23.76955 NA
# 7: 3 1 63.69872 28.28602 NA
# 8: 3 2 NA 48.29153 43.59887
# 9: 3 3 43.30174 NA NA
# 10: 4 3 71.41221 NA NA
# 11: 5 3 33.62969 NA NA
Then create a data frame to_diff of differences to calculate, and pmap over that to calculate the differences. Here c0_1 corresponds to what you call in your question diff(price, 0-1).
to_diff <- CJ(0:2, 0:2)[V1 < V2]
pmap(to_diff, ~ df2[[as.character(.x)]] - df2[[as.character(.y)]]) %>%
setNames(paste0('c', to_diff[[1]], '_', to_diff[[2]])) %>%
bind_cols(df2[, 1:2])
# A tibble: 11 x 5
# c0_1 c0_2 c1_2 rid date
# <dbl> <dbl> <dbl> <int> <int>
# 1 5.31 13.0 7.67 1 1
# 2 -73.9 -2.33 71.5 1 2
# 3 26.7 NA NA 1 3
# 4 -31.2 NA NA 2 1
# 5 NA NA 3.20 2 2
# 6 52.1 NA NA 2 3
# 7 35.4 NA NA 3 1
# 8 NA NA 4.69 3 2
# 9 NA NA NA 3 3
# 10 NA NA NA 4 3
# 11 NA NA NA 5 3
Let's say I have this kind of data frame:
df <- data.frame(
t=rep(seq(0,2),6),
no=rep(c(1,2,3,4,5,6),each=3),
value=rnorm(18),g=rep(c("nc","c1", NA),each=3)
)
t no value g
1 0 1 0.5022163 nc
2 1 1 0.5687227 nc
3 2 1 -0.2922622 nc
4 0 2 -0.3587089 c1
5 1 2 -0.9028012 c1
6 2 2 0.1926774 c1
7 0 3 0.6771236 NA
8 1 3 0.3752632 NA
9 2 3 0.2795892 NA
10 0 4 -0.4565521 nc
11 1 4 -0.1241807 nc
12 2 4 -1.2603695 nc
13 0 5 -0.6323118 c1
14 1 5 -0.6283850 c1
15 2 5 -0.2052317 c1
16 0 6 1.5996913 NA
17 1 6 -0.4802057 NA
18 2 6 -0.4255056 NA
I want to set the values in df$value to NA whenever there is NA in df$g (only in the same rows).
And similarly, set the values in df$value to NA, if df$no is, e.g., 1 or 5.
I was fooling around with for loops, but I could not get it right.
Any help will be much appreciated.
Thanks
With a for loop
for (i in 1:nrow(df)) {
if (df$no[i] == 1 | df$no[i] == 5 | is.na(df$g[i])) {
df$value[i] <- NA
}
}
I have four vectors (columns)
x y z t
1 1 1 10
1 1 1 15
1 4 1 14
2 3 1 15
2 2 1 17
2 1 2 19
2 4 2 18
2 4 2 NA
2 2 2 45
3 3 2 NA
3 1 3 59
4 3 3 23
4 4 3 45
4 4 4 74
5 1 4 86
I know how to calculate the mean and median of vector t for each value from x,y, and z.
The example is:
bar <- data.table(expand.grid(x=unique(data[x %in% c(1,2,3,4,5),x]),
y=unique(data[y %in% c(1,2,3,4),y]),
z=unique(data[z %in% c(1,2,3,4),z])))
foo <- data[z %in% c(1,2,3,4),list(
mean.t=mean(t,na.rm=T),
median.t=median(t,na.rm=T))
,by=list(x,y,z)]
merge(bar[,list(x,y,z)],foo,by=c("x","y","z"),all.x=T)
The result is:
x y z mean.t median.t
1: 1 1 1 12.5 12.5
2: 1 1 2 NA NA
3: 1 1 3 NA NA
4: 1 1 4 NA NA
5: 1 2 1 NA NA
........................
79: 5 4 3 NA NA
80: 5 4 4 NA NA
But now I have the question: how to do the same calculations for x,y,z and t but for z not as numbers from 1 to 4, but for groups like:
if 0 < z <= 2 group I,
if 2 < z <= 3 group II and
if 3 < z <= 4 group III.
So, the output should be in format like:
x y z mean.t median.t
1: 1 1 I
2: 1 1 II
3: 1 1 III
4: 1 2 I
5: 1 2 II
6: 1 2 III
7: 1 3 I
8: 1 3 II
9: 1 3 III
10: 1 4 I
..........
Define a new column, zGroup to group by.
(The data in this example is a little different than yours)
#create some data
dt<-data.table(x=rep(c(1,2),each=4),
y=rep(c(1,2),each=2,times=2),
z=rep(c(1,2,3,4),times=2),t=1:8)
#add a zGroup column
dt[0<z & z<=2, zGroup:=1]
dt[2<z & z<=3, zGroup:=2]
dt[3<z & z<=4, zGroup:=3]
#group by unique combinations of x, y, zGroup taking mean and median of t
dt[,list(mean.t=mean(t), median.t=as.double(median(t))), by=list(x,y,zGroup)]
Note, this will error without coercing the median to a double. See this post for details.