reshape data to different columns based on value from single column - r

Hi I have a dataframe as below (example)..
time <-c( 8/11/2017 10:21, 8/10/2017 22:34, 8/16/2017 2:28, 8/14/2017 6:17, 8/11/2017 6:33, 8/15/2017 23:46, 8/10/2017 20:10, 8/14/2017 3:35, 8/11/2017 4:09, 8/15/2017 21:05, 8/11/2017 2:16, 8/10/2017 18:17, 8/13/2017 10:02, 8/13/2017 9:08, 8/13/2017 8:32, 8/13/2017 8:20, 8/13/2017 7:56)
code <- c( 1, 3, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2)
var1 <- c( 5, 11, 16, 22, 27, 33, 38, 44, 49, 55, 60, 66, 71, 77, 66, 71, 77)
var2 <- c( 115, 66, 71, 33, 38, 110, 115, 121, 126, 132, 104, 66, 71, 77, 115, 121, 66)
var3 <- c( 38, 44, 49, 55, 60, 66, 71, 77, 66, 71, 77, 132, 104, 66, 71, 77, 115)
time code var1 var2 var3
8/11/2017 10:21 1 5 115 38
8/10/2017 22:34 3 11 66 44
8/16/2017 2:28 2 16 71 49
8/14/2017 6:17 2 22 33 55
8/11/2017 6:33 1 27 38 60
8/15/2017 23:46 2 33 110 66
8/10/2017 20:10 2 38 115 71
8/14/2017 3:35 1 44 121 77
8/11/2017 4:09 1 49 126 66
8/15/2017 21:05 2 55 132 71
8/11/2017 2:16 2 60 104 77
8/10/2017 18:17 2 66 66 132
8/13/2017 10:02 2 71 71 104
8/13/2017 9:08 2 77 77 66
8/13/2017 8:32 1 66 115 71
8/13/2017 8:20 1 71 121 77
8/13/2017 7:56 2 77 66 115
I want to recast this dataframe using the column "code". The output I'm expecting should be as below.
time code1_var1 code1_var2 code1_var3 code2_var1 code2_var2 code2_var3 code3_var1 code3_var2 code3_var3
8/11/2017 10:21
8/10/2017 22:34
8/16/2017 2:28
8/14/2017 6:17
8/11/2017 6:33
8/15/2017 23:46
8/10/2017 20:10
8/14/2017 3:35
8/11/2017 4:09
8/15/2017 21:05
8/11/2017 2:16
8/10/2017 18:17
8/13/2017 10:02
8/13/2017 9:08
8/13/2017 8:32
8/13/2017 8:20
8/13/2017 7:56
But when I tried dcast funtion in R It is giving me an error for time variable.
Please help me with this reshaping objective
note: The result should have many NA because of reshaping and missing data.

The easiest way to do this would be with dcast from "data.table" or even reshape from base R.
Assuming your vectors are collected in a data.frame named "d", try the following:
library(data.table)
setDT(d)
x <- dcast(d, time ~ code, value.var = paste0("var", 1:3))
head(x)
# time var1_1 var1_2 var1_3 var2_1 var2_2 var2_3 var3_1 var3_2 var3_3
# 1: 8/10/2017 18:17 NA 66 NA NA 66 NA NA 132 NA
# 2: 8/10/2017 20:10 NA 38 NA NA 115 NA NA 71 NA
# 3: 8/10/2017 22:34 NA NA 11 NA NA 66 NA NA 44
# 4: 8/11/2017 10:21 5 NA NA 115 NA NA 38 NA NA
# 5: 8/11/2017 2:16 NA 60 NA NA 104 NA NA 77 NA
# 6: 8/11/2017 4:09 49 NA NA 126 NA NA 66 NA NA
OR
reshape(d, direction = "wide", idvar = "time", timevar = "code")
If you wanted to use the tidyverse, you would need to first gather, then create a new "times" variable, and then reshape to the wide format:
library(tidyverse)
d %>%
gather(variable, value, starts_with("var")) %>%
unite(key, code, variable) %>%
spread(key, value)

Related

Sample part of a dataset while keeping subgroups intact

I have a dataframe which I would like to split into one 75% and one 25% parts of the original.
I thought a good first step would be to create the 25% dataset from the original dataset, by randomly sampling a quarter of the data.
However sampling shouldn't be entirely random, I want to preserve groups of a certain variable.
So with the example below, I want to randomly sample 1/4 of the data frame, but data needs to remain grouped via the 'team' variable. I have 8 teams, so I want to randomly sample 2 teams.
Data example (dput below)
team points assists
1 1 99 33
2 1 90 28
3 1 86 31
4 1 88 39
5 2 95 34
6 2 92 30
7 2 91 32
8 2 79 35
9 3 85 36
10 3 90 29
11 3 91 24
12 3 97 26
13 4 96 28
14 4 94 18
15 4 95 19
16 4 98 25
17 5 78 36
18 5 80 34
19 5 85 39
20 5 89 33
21 6 94 34
22 6 85 39
23 6 99 28
24 6 79 31
25 7 78 35
26 7 99 29
27 7 98 36
28 7 75 39
29 8 97 33
30 8 68 26
31 8 86 38
32 8 76 31
I've tried this using the slice_sample code from dplyr, but this does the exact opposite of what I want (it splits all teams)
testdata <- df %>% group_by(team) %>% slice_sample(n = 2)
My code results in
team points assists
<dbl> <dbl> <dbl>
1 1 90 28
2 1 99 33
3 2 95 34
4 2 92 30
5 3 91 24
6 3 85 36
7 4 95 19
8 4 98 25
9 5 80 34
10 5 78 36
11 6 85 39
12 6 94 34
13 7 78 35
14 7 98 36
15 8 76 31
16 8 86 38
Example of the dataframe:
structure(list(team = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4,
4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8), points = c(99,
90, 86, 88, 95, 92, 91, 79, 85, 90, 91, 97, 96, 94, 95, 98, 78,
80, 85, 89, 94, 85, 99, 79, 78, 99, 98, 75, 97, 68, 86, 76),
assists = c(33, 28, 31, 39, 34, 30, 32, 35, 36, 29, 24, 26,
28, 18, 19, 25, 36, 34, 39, 33, 34, 39, 28, 31, 35, 29, 36,
39, 33, 26, 38, 31)), class = "data.frame", row.names = c(NA,
-32L))
With dplyr, if you group_by(team) and then sample, that's sampling within each team--the opposite of what you want. Here's a direct approach:
test_teams = sample(unique(dataset$team), size = 2)
test = dataset %>% filter(team %in% test_teams)
train = dataset %>% filter(!team %in% test_teams)
library(caTools)
split <- sample.split(dataset$team, SplitRatio = 0.75)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)

how to create weekly result from daily data using R?

I try to create a weekly cumulative result from this daily data with detail below
date A B C D E F G H
16-Jan-22 227 441 3593 2467 9 6 31 2196
17-Jan-22 224 353 3555 2162 31 5 39 2388
18-Jan-22 181 144 2734 2916 0 0 14 1753
19-Jan-22 95 433 3610 3084 42 19 10 2862
20-Jan-22 141 222 3693 3149 183 19 23 2176
21-Jan-22 247 426 3455 4016 68 0 1 2759
22-Jan-22 413 931 4435 4922 184 2 39 3993
23-Jan-22 389 1340 5433 5071 200 48 27 4495
24-Jan-22 281 940 6875 5009 343 47 71 3713
25-Jan-22 314 454 5167 4555 127 1 68 3554
26-Jan-22 315 973 5789 3809 203 1 105 4456
27-Jan-22 269 1217 6776 4578 227 91 17 5373
28-Jan-22 248 1320 5942 3569 271 91 156 4260
29-Jan-22 155 1406 6771 4328 426 44 109 4566
Solution using data.table and lubridate
library(lubridate)
library(data.table)
setDT(df)
df[, lapply(.SD, sum), by = isoweek(dmy(date))]
# isoweek A B C D E F G H
# 1: 2 227 441 3593 2467 9 6 31 2196
# 2: 3 1690 3849 26915 25320 708 93 153 20426
# 3: 4 1582 6310 37320 25848 1597 275 526 25922
I wanted to provide a solution using the tidyverse principles.
You will need to use the group_by and summarize formulas and this very useful across() function.
#recreate data in tribble
df <- tribble(
~"date", ~"A", ~"B", ~"C", ~"D", ~"E", ~"F", ~"G", ~H,
"16-Jan-22",227, 441, 3593, 2467, 9, 6, 31, 2196,
"17-Jan-22",224, 353, 3555, 2162, 31, 5, 39, 2388,
"18-Jan-22",181, 144, 2734, 2916, 0, 0, 14, 1753,
"19-Jan-22",95, 433, 3610, 3084, 42, 19, 10, 2862,
"20-Jan-22",141, 222, 3693, 3149, 183, 19, 23, 2176,
"21-Jan-22",247, 426, 3455, 4016, 68, 0, 1, 2759,
"22-Jan-22",413, 931, 4435, 4922, 184, 2, 39, 3993,
"23-Jan-22",389, 1340, 5433, 5071, 200, 48, 27, 4495,
"24-Jan-22",281, 940, 6875, 5009, 343, 47, 71, 3713,
"25-Jan-22",314, 454, 5167, 4555, 127, 1, 68, 3554,
"26-Jan-22",315, 973, 5789, 3809, 203, 1, 105, 4456,
"27-Jan-22",269, 1217, 6776, 4578, 227, 91, 17, 5373,
"28-Jan-22",248, 1320, 5942, 3569, 271, 91, 156, 4260,
"29-Jan-22",155, 1406, 6771, 4328, 426, 44, 109, 4566)
#change date to format date
df$date <- ymd(df$date)
# I both create a new column "week_num" and group it by this variables
## Then I summarize that for each column except for "date", take each sum
df %>%
group_by(week_num=lubridate::isoweek(date)) %>%
summarize(across(-c("date"),sum))
# I get this results
week_num A B C D E F G H
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 2017 6028 33189 26785 990 243 310 25464
2 4 1482 4572 34639 26850 1324 131 400 23080
Group_by() and summarize() are relatively straight forward. Across() is a fairly new verb that is super powerful. It allows you to reference columns using tidy selection principles (eg. starts_with(), c(1:9), etc) and if you apply it a formula it will allow that formula to each of the selected columns. Less typing!
Alternatively you would have to individually sum each column A=sum(A) which is more typing.

How do I create a loop and/or a function to divide 200 columns (and create 200 new columns/variables) by another column to get a percentage?

How do I create a loop and/or a function to divide 200 columns (and create 200 new columns/variables) by another column to get a percentage?
How do I do this in a loop so I can do 200 columns? and how do I name the
name the columns so that it is the old column name with a "p_" in front of it?
Is this possible?
For example I'm trying to do something like this but with 200 columns.
fans <- data.frame(
population = c(1234, 5678, 2345, 6789, 3456, 7890,
4567, 8901, 5678, 9012, 6789),
bearsfans = c(123, 234, 345, 456, 567,678, 789, 890, 901, 135, 246),
packersfans = c(11,22,33,44,55,66,77,88,99,100,122),
vikingsfans = c(39, 49, 59, 61, 32, 22, 31, 92, 52, 10, 122))
print(fans)
attach(fans)
## create new columns which are the ratio of fans to population
fans$p_bearsfan = bearsfans/population
print(fans)
Output:
## population bearsfans packersfans vikingsfans p_bearsfan
## 1 1234 123 11 39 0.09967585
## 2 5678 234 22 49 0.04121169
temp <- sapply(fans[-1], function(x) x / fans$population)
colnames(temp) <- paste0("p_", colnames(temp))
cbind(fans, temp)
population bearsfans packersfans vikingsfans p_bearsfans p_packersfans p_vikingsfans
1 1234 123 11 39 0.09967585 0.008914100 0.031604538
2 5678 234 22 49 0.04121169 0.003874604 0.008629799
3 2345 345 33 59 0.14712154 0.014072495 0.025159915
4 6789 456 44 61 0.06716748 0.006481072 0.008985123
5 3456 567 55 32 0.16406250 0.015914352 0.009259259
6 7890 678 66 22 0.08593156 0.008365019 0.002788340
7 4567 789 77 31 0.17276111 0.016860083 0.006787826
8 8901 890 88 92 0.09998877 0.009886530 0.010335917
9 5678 901 99 52 0.15868263 0.017435717 0.009158154
10 9012 135 100 10 0.01498003 0.011096316 0.001109632
11 6789 246 122 122 0.03623509 0.017970246 0.017970246
If you're happy with a suffixed new column name (instead of prefixing), this is a one-liner using dplyr::mutate_at. I assume here that all relevant columns end with the word "fans".
With suffixes
fans %>% mutate_at(vars(ends_with("fans")), list(percent = ~.x / population))
# population bearsfans packersfans vikingsfans bearsfans_percent
#1 1234 123 11 39 0.09967585
#2 5678 234 22 49 0.04121169
#3 2345 345 33 59 0.14712154
#4 6789 456 44 61 0.06716748
#5 3456 567 55 32 0.16406250
#6 7890 678 66 22 0.08593156
#7 4567 789 77 31 0.17276111
#8 8901 890 88 92 0.09998877
#9 5678 901 99 52 0.15868263
#10 9012 135 100 10 0.01498003
#11 6789 246 122 122 0.03623509
# packersfans_percent vikingsfans_percent
#1 0.008914100 0.031604538
#2 0.003874604 0.008629799
#3 0.014072495 0.025159915
#4 0.006481072 0.008985123
#5 0.015914352 0.009259259
#6 0.008365019 0.002788340
#7 0.016860083 0.006787826
#8 0.009886530 0.010335917
#9 0.017435717 0.009158154
#10 0.011096316 0.001109632
#11 0.017970246 0.017970246
With prefixes
To turn the suffixes into prefixes requires on more step
fans %>%
mutate_at(vars(ends_with("fans")), list(percent = ~.x / population)) %>%
rename_at(vars(ends_with("percent")), ~sub("(.+)_percent", "p_\\1", .x))
# population bearsfans packersfans vikingsfans p_bearsfans p_packersfans
#1 1234 123 11 39 0.09967585 0.008914100
#2 5678 234 22 49 0.04121169 0.003874604
#3 2345 345 33 59 0.14712154 0.014072495
#4 6789 456 44 61 0.06716748 0.006481072
#5 3456 567 55 32 0.16406250 0.015914352
#6 7890 678 66 22 0.08593156 0.008365019
#7 4567 789 77 31 0.17276111 0.016860083
#8 8901 890 88 92 0.09998877 0.009886530
#9 5678 901 99 52 0.15868263 0.017435717
#10 9012 135 100 10 0.01498003 0.011096316
#11 6789 246 122 122 0.03623509 0.017970246
# p_vikingsfans
#1 0.031604538
#2 0.008629799
#3 0.025159915
#4 0.008985123
#5 0.009259259
#6 0.002788340
#7 0.006787826
#8 0.010335917
#9 0.009158154
#10 0.001109632
#11 0.017970246
We can directly divide multiple columns with one column. We use grep to select columns which end with "fans" and use those names to assign new columns.
cols <- grep("fans$", names(fans), value = TRUE)
fans[paste0("p_", cols)] <- fans[cols]/fans$population
fans
# population bearsfans packersfans vikingsfans p_bearsfans p_packersfans p_vikingsfans
#1 1234 123 11 39 0.09968 0.008914 0.031605
#2 5678 234 22 49 0.04121 0.003875 0.008630
#3 2345 345 33 59 0.14712 0.014072 0.025160
#4 6789 456 44 61 0.06717 0.006481 0.008985
#5 3456 567 55 32 0.16406 0.015914 0.009259
#6 7890 678 66 22 0.08593 0.008365 0.002788
#7 4567 789 77 31 0.17276 0.016860 0.006788
#8 8901 890 88 92 0.09999 0.009887 0.010336
#9 5678 901 99 52 0.15868 0.017436 0.009158
#10 9012 135 100 10 0.01498 0.011096 0.001110
#11 6789 246 122 122 0.03624 0.017970 0.017970
Also as a side note : Why is it not advisable to use attach() in R, and what should I use instead?
Data
fans <- data.frame(
population = c(1234, 5678, 2345, 6789, 3456, 7890,
4567, 8901, 5678, 9012, 6789),
bearsfans = c(123, 234, 345, 456, 567,678, 789, 890, 901, 135, 246),
packersfans = c(11,22,33,44,55,66,77,88,99,100,122),
vikingsfans = c(39, 49, 59, 61, 32, 22, 31, 92, 52, 10, 122))
Naive Solution
for (c in names(fans)[-1]){
fans[[paste0("p_",c)]] <- fans[[c]] / fans[["population"]]
}

R Impute NA's by Linear Increase Depending on Time Interval

PROBLEM
I neeed to impute the NA's in my data frame that comes from a repeated measures study. On this particular outcome, I need to impute the NA's with the last observed non-NA value +1 by each +52 week interval starting from the last observed value.
EXAMPLE
An example data frame with the target imputation goal included.
df <- data.frame(
subject = rep(1:3, each = 12),
week = rep(c(8, 10, 12, 16, 20, 26, 32, 44, 52, 64, 78, 104),3),
value = c(112, 97, 130, 104, NA, NA, NA, NA, NA, NA, NA, NA,
89, 86, 94, 96, 88,107, 110, 102, 107, NA, NA, NA,
107, 110, 102, 130, 104, 88, 82, 79, 92, 106, NA, NA),
goal = c(112, 97, 130, 104, 104, 104, 104, 104, 104, 104, 105, 105,
89, 86, 94, 96, 88,107, 110, 102, 107, 107,107, 108,
107, 110, 102, 130, 104, 88, 82, 79, 92, 106, 106, 106)
)
I left the intermediate columns in to make what's happening more obvious, but you can remove them with a simple select.
df = df %>%
group_by(subject) %>%
mutate(last_obs_week = max(week[!is.na(value)]),
since_last_week = pmax(0, week - last_obs_week),
inc_52 = since_last_week %/% 52,
result = zoo::na.locf(value) + inc_52
)
all(df$goal == df$result)
# [1] TRUE
print.data.frame(df)
# subject week value goal last_obs_week since_last_week inc_52 result
# 1 1 8 112 112 16 0 0 112
# 2 1 10 97 97 16 0 0 97
# 3 1 12 130 130 16 0 0 130
# 4 1 16 104 104 16 0 0 104
# 5 1 20 NA 104 16 4 0 104
# 6 1 26 NA 104 16 10 0 104
# 7 1 32 NA 104 16 16 0 104
# 8 1 44 NA 104 16 28 0 104
# 9 1 52 NA 104 16 36 0 104
# 10 1 64 NA 104 16 48 0 104
# 11 1 78 NA 105 16 62 1 105
# 12 1 104 NA 105 16 88 1 105
# 13 2 8 89 89 52 0 0 89
# ...
One can use dplyr and tidyr::fill to get the desired result. The logic will be to add a column to track the week which had the non-NA value. Use tidyr::fill to populate last non-NA value and then check if difference of current week with last non-NA week is more than 52 then increase the value by 1.
library(dplyr)
library(tidyr)
df %>% group_by(subject) %>%
mutate(weekWithLastNonNaValue = ifelse(is.na(value), NA, week)) %>%
fill(value, weekWithLastNonNaValue) %>%
mutate(value = value + (week-weekWithLastNonNaValue) %/% 52) %>%
select(-weekWithLastNonNaValue) %>%
as.data.frame()
# subject week value goal
# 1 1 8 112 112
# 2 1 10 97 97
# 3 1 12 130 130
# 4 1 16 104 104
# 5 1 20 104 104
# 6 1 26 104 104
# 7 1 32 104 104
# 8 1 44 104 104
# 9 1 52 104 104
# 10 1 64 104 104
# 11 1 78 105 105
# 12 1 104 105 105
# 13 2 8 89 89
# 14 2 10 86 86
# 15 2 12 94 94
# 16 2 16 96 96
# 17 2 20 88 88
# 18 2 26 107 107
# 19 2 32 110 110
# 20 2 44 102 102
#
# so on
#

Creating a subset in R using a double loop Continuation

DF:
Year 1901 1901 1903 1968 1978 2002 2006 2010
species 1 1 2 65 1 82 3 1
lat: 49 46 47 47 48 43.1 44.23 47.11
long: -79.22 -79.5 -78.22 -79.84 -78.11 -77.114 -76.33 -76.2
Julian_Day: 79 125 165 178 193 68 90 230
Land: 16 24 25 30 34 34 39 41
There are more variables but that's an example of the matrix. I only want to keep the rows for each year AND for each species that has the lowest value for the Julian_day. Ie: the second row would be omitted here, because 79 is less than 125 for species 1 in 1901.
First of all. I would suggest you providing a data.frame in a format that is easy for people to use. We'll be able to help you better and faster
df <- structure(list(Year = c(1901, 1901, 1903, 1968, 1978,
2002, 2006, 2010), species = c(1, 1, 2, 65, 1, 82, 3, 1), lat =
c(49, 46, 47, 47, 48, 43.1, 44.23, 47.11), long = c(79.22,
-79.5, -78.22, -79.84, -78.11, -77.114, -76.33, -76.2),
Julian_Day = c(79, 125, 165, 178, 193, 68, 90, 230), Land =
c(16, 24, 25, 30, 34, 34, 39, 41)), .Names =
c("Year", "species", "lat", "long", "Julian_Day", "Land"),
row.names = c(NA, -8L), class = "data.frame")
Here is your data.frame
df
# Year species lat long Julian_Day Land
#1: 1901 1 49.00 79.220 79 16
#2: 1901 1 46.00 -79.500 125 24
#3: 1903 2 47.00 -78.220 165 25
#4: 1968 65 47.00 -79.840 178 30
#5: 1978 1 48.00 -78.110 193 34
#6: 2002 82 43.10 -77.114 68 34
#7: 2006 3 44.23 -76.330 90 39
#8: 2010 1 47.11 -76.200 230 41
Generally, you just have to do dput(head(your dataframe)) But you can build a small fake data frame to illustrate your point if cannot reveal your data.
Her's a possible solution using the data.table package
library(data.table)
setDT(df)[ ,.SD[which.min(Julian_Day)], .(species, Year)]

Resources