Mutate a part of my variables to a unique column [duplicate] - r

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 5 years ago.
I am coding in R.
I have a table like :
region;2012;2013;2014;2015
1;2465;245;2158;645
2;44;57;687;564
3;545;784;897;512
...
And I want to transform it into :
region;value;annee
1;2465;2012
1;245;2013
1;2158;2014
1;645;2015
2;44;2012
...
Do you know how I can do it ?

First, read the data:
dat <- read.csv2(text = "region;2012;2013;2014;2015
1;2465;245;2158;645
2;44;57;687;564
3;545;784;897;512",
check.names = FALSE)
The data frame con be converted into the long format with gather from package tidyr.
library(tidyr)
dat_long <- gather(dat, key = "annee", , -region)
The result:
region annee value
1 1 2012 2465
2 2 2012 44
3 3 2012 545
4 1 2013 245
5 2 2013 57
6 3 2013 784
7 1 2014 2158
8 2 2014 687
9 3 2014 897
10 1 2015 645
11 2 2015 564
12 3 2015 512
You can also produce the ;-separated result of your question:
write.csv2(dat_long, "", row.names = FALSE, quote = FALSE)
This results in:
region;annee;value
1;2012;2465
2;2012;44
3;2012;545
1;2013;245
2;2013;57
3;2013;784
1;2014;2158
2;2014;687
3;2014;897
1;2015;645
2;2015;564
3;2015;512

An example to answer the question :
olddata_wide
#> subject sex control cond1 cond2
#> 1 1 M 7.9 12.3 10.7
#> 2 2 F 6.3 10.6 11.1
#> 3 3 F 9.5 13.1 13.8
#> 4 4 M 11.5 13.4 12.9
library(tidyr)
# The arguments to gather():
# - data: Data object
# - key: Name of new key column (made from names of data columns)
# - value: Name of new value column
# - ...: Names of source columns that contain values
# - factor_key: Treat the new key column as a factor (instead of character vector)
data_long <- gather(olddata_wide, condition, measurement, control:cond2, factor_key=TRUE)
data_long
#> subject sex condition measurement
#> 1 1 M control 7.9
#> 2 2 F control 6.3
#> 3 3 F control 9.5
#> 4 4 M control 11.5
#> 5 1 M cond1 12.3
#> 6 2 F cond1 10.6
#> 7 3 F cond1 13.1
#> 8 4 M cond1 13.4
#> 9 1 M cond2 10.7
#> 10 2 F cond2 11.1
#> 11 3 F cond2 13.8
#> 12 4 M cond2 12.9

Related

Add character to specific value in rows by condition

lets say we have following data:
df1 = data.frame(cm= c('10129', '21120', '123456','345239'),
num=c(6,6,6,6))
> df1
cm num
10129 6
21120 6
123456 6
345 4
as you see the length of some boxes in the cm column is 6 digits and some of 5 digits. I want to code the following: if the number of digits in the cm column is less than number in num column, add 0 value in the front to get the given output:
cm num
010129 6
021120 6
123456 6
0345 4
You can use str_pad
library(tidyverse)
df1 %>% mutate(cm = str_pad(cm, num, "left", "0"))
#> cm num
#> 1 010129 6
#> 2 021120 6
#> 3 123456 6
#> 4 0345 4
Created on 2022-04-13 by the reprex package (v2.0.1)
Input Data
df1 <- data.frame(cm = c('10129', '21120', '123456','345'), num = c(6,6,6,4))
df1
#> cm num
#> 1 10129 6
#> 2 21120 6
#> 3 123456 6
#> 4 345 4
Perhaps simplest using dplyr and nchar:
library(dplyr)
df1 %>% mutate(cm = if_else(nchar(cm) < num, paste0(0, cm), cm))
cm num
1 10129 6
2 21120 6
3 123456 6
4 345239 6
The other tidyverse/dplyr answers are nicer, but if you want to stick to base R for some reason:
df1$cm <- ifelse(nchar(df1$cm) < df1$num, paste0('0', df1$cm), df1$cm)
df1
#> cm num
#> 1 010129 6
#> 2 021120 6
#> 3 123456 6
#> 4 345239 6

Complex aggregate function construction in R? [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 2 years ago.
Probably this is not that complex, but I couldn't figure out how to write a concise title explaining it:
I'm trying to use the aggregate function in R to return (1) the lowest value of a given column (val) by category (cat.2) in a data frame and (2) the value of another column (cat.1) on the same row. I know how to do part #1, but I can't figure out part #2.
The data:
cat.1<-c(1,2,3,4,5,1,2,3,4,5)
cat.2<-c(1,1,1,2,2,2,2,3,3,3)
val<-c(10.1,10.2,9.8,9.7,10.5,11.1,12.5,13.7,9.8,8.9)
df<-data.frame(cat.1,cat.2,val)
> df
cat.1 cat.2 val
1 1 1 10.1
2 2 1 10.2
3 3 1 9.8
4 4 2 9.7
5 5 2 10.5
6 1 2 11.1
7 2 2 12.5
8 3 3 13.7
9 4 3 9.8
10 5 3 8.9
I know how to use aggregate to return the minimum value for each cat.2:
> aggregate(df$val, by=list(df$cat.2), FUN=min)
Group.1 x
1 1 9.8
2 2 9.7
3 3 8.9
The second part of it, which I can't figure out, is to return the value in cat.1 on the same row of df where aggregate found min(df$val) for each cat.2. Not sure I'm explaining it well, but this is the intended result:
> ...
Group.1 x cat.1
1 1 9.8 3
2 2 9.7 4
3 3 8.9 5
Any help much appreciated.
If we need the output after the aggregate, we can do a merge with original dataset
merge(aggregate(df$val, by=list(df$cat.2), FUN=min),
df, by.x = c('Group.1', 'x'), by.y = c('cat.2', 'val'))
# Group.1 x cat.1
#1 1 9.8 3
#2 2 9.7 4
#3 3 8.9 5
But, this can be done more easily with dplyr by using slice to slice the rows with the min value of 'val' after grouping by 'cat.2'
library(dplyr)
df %>%
group_by(cat.2) %>%
slice(which.min(val))
# A tibble: 3 x 3
# Groups: cat.2 [3]
# cat.1 cat.2 val
# <dbl> <dbl> <dbl>
#1 3 1 9.8
#2 4 2 9.7
#3 5 3 8.9
Or with data.table
library(data.table)
setDT(df)[, .SD[which.min(val)], cat.2]
Or in base R, this can be done with ave
df[with(df, val == ave(val, cat.2, FUN = min)),]
# cat.1 cat.2 val
#3 3 1 9.8
#4 4 2 9.7
#10 5 3 8.9

find first occurrence in two variables in df

I need to find the first two times my df meets a certain condition grouped by two variables. I am trying to use the ddply function, but I am doing something wrong with the ".variables" command.
So in this example, I'm trying to find the first two times x > 30 and y > 30 in each group / trial.
The way I'm using ddply is giving me the first two times in the dataset, then repeating that for every group.
set.seed(1)
df <- data.frame((matrix(nrow=200,ncol=5)))
colnames(df) <- c("group","trial","x","y","hour")
df$group <- rep(c("A","B","C","D"),each=50)
df$trial <- rep(c(rep(1,times=25),rep(2,times=25)),times=4)
df[,3:4] <- runif(400,0,50)
df$hour <- rep(1:25,time=8)
library(plyr)
ddply(.data=df, .variables=c("group","trial"), .fun=function(x) {
i <- which(df$x > 30 & df$y >30 )[1:2]
if (!is.na(i)) x[i, ]
})
Expected results:
group trial x y hour
13 A 1 34.3511423 38.161134 13
15 A 1 38.4920710 40.931734 15
36 A 2 33.4233369 34.481392 11
37 A 2 39.7119930 34.470671 12
52 B 1 43.0604738 46.645491 2
65 B 1 32.5435234 35.123126 15
But instead, my code is finding c(1,4) from the first grouptrial and repeating that over for every grouptrial:
group trial x y hour
1 A 1 34.351142 38.161134 13
2 A 1 38.492071 40.931734 15
3 A 2 5.397181 27.745031 13
4 A 2 20.563721 22.636003 15
5 B 1 22.953286 13.898301 13
6 B 1 32.543523 35.123126 15
I would also like for there to be rows of NA if a second occurrence isn't present in a group*trial.
Thanks,
I think this is what you want:
library(tidyverse)
df %>% group_by(group, trial) %>% filter(x > 30 & y > 30) %>% slice(1:2)
Result:
# A tibble: 16 x 5
# Groups: group, trial [8]
group trial x y hour
<chr> <dbl> <dbl> <dbl> <int>
1 A 1 33.5 46.3 4
2 A 1 32.6 42.7 11
3 A 2 35.9 43.6 4
4 A 2 30.5 42.7 14
5 B 1 33.0 38.1 2
6 B 1 40.5 30.4 7
7 B 2 48.6 33.2 2
8 B 2 34.1 30.9 4
9 C 1 33.0 45.1 1
10 C 1 30.3 36.7 17
11 C 2 44.8 33.9 1
12 C 2 41.5 35.6 6
13 D 1 44.2 34.3 12
14 D 1 39.1 40.0 23
15 D 2 39.4 47.5 4
16 D 2 42.1 40.1 10
(slightly different from your results, probably a different R version)
I reccomend using dplyr or data.table rather than plyr. From the plyr github page:
plyr is retired: this means only changes necessary to keep it on CRAN
will be made. We recommend using dplyr (for data frames) or purrr (for
lists) instead.
Since someone has already provided a solution with dplyr, here is one option with data.table.
In the selection df[i, j, k] I am selecting rows which match your criteria in i, grouping by the given variables in k, and selecting the first two rows (head) of each group-specific subset of the data .SD. All of this inside the brackets is data.table specific, and only works because I converted df to a data.table first with setDT.
library(data.table)
setDT(df)
df[x > 30 & y > 30, head(.SD, 2), by = .(group, trial)]
# group trial x y hour
# 1: A 1 34.35114 38.16113 13
# 2: A 1 38.49207 40.93173 15
# 3: A 2 33.42334 34.48139 11
# 4: A 2 39.71199 34.47067 12
# 5: B 1 43.06047 46.64549 2
# 6: B 1 32.54352 35.12313 15
# 7: B 2 48.03090 38.53685 5
# 8: B 2 32.11441 49.07817 18
# 9: C 1 32.73620 33.68561 1
# 10: C 1 32.00505 31.23571 20
# 11: C 2 32.13977 40.60658 9
# 12: C 2 34.13940 49.47499 16
# 13: D 1 36.18630 34.94123 19
# 14: D 1 42.80658 46.42416 23
# 15: D 2 37.05393 43.24038 3
# 16: D 2 44.32255 32.80812 8
To try a solution that is closer to what you've tried so far we can do the following
ddply(.data=df, .variables=c("group","trial"), .fun=function(df_temp) {
i <- which(df_temp$x > 30 & df_temp$y >30 )[1:2]
df_temp[i, ]
})
Some explanation
One problem with the code that you provided is that you used df inside of ddply. So you defined fun= function(x) but you didn't look for cases of x> 30 & y> 30 in x but in df. Further, your code uses i for x, but i was defined with df. Finally, to my understanding there is no need for if (!is.na(i)) x[i, ]. If there is only one row that meets your condition, you will get a row with NAs anayway, because you use which(df_temp$x > 30 & df_temp$y >30 )[1:2].
Using dplyr, you can also do:
df %>%
group_by(group, trial) %>%
slice(which(x > 30 & y > 30)[1:2])
group trial x y hour
<chr> <dbl> <dbl> <dbl> <int>
1 A 1 34.4 38.2 13
2 A 1 38.5 40.9 15
3 A 2 33.4 34.5 11
4 A 2 39.7 34.5 12
5 B 1 43.1 46.6 2
6 B 1 32.5 35.1 15
7 B 2 48.0 38.5 5
8 B 2 32.1 49.1 18
Since everything else is covered here is a base R version using split
output <- do.call(rbind, lapply(split(df, list(df$group, df$trial)),
function(new_df) new_df[with(new_df, head(which(x > 30 & y > 30), 2)), ]
))
rownames(output) <- NULL
output
# group trial x y hour
#1 A 1 34.351 38.161 13
#2 A 1 38.492 40.932 15
#3 B 1 43.060 46.645 2
#4 B 1 32.544 35.123 15
#5 C 1 32.736 33.686 1
#6 C 1 32.005 31.236 20
#7 D 1 36.186 34.941 19
#8 D 1 42.807 46.424 23
#9 A 2 33.423 34.481 11
#10 A 2 39.712 34.471 12
#11 B 2 48.031 38.537 5
#12 B 2 32.114 49.078 18
#13 C 2 32.140 40.607 9
#14 C 2 34.139 49.475 16
#15 D 2 37.054 43.240 3
#16 D 2 44.323 32.808 8

Plot aggregate with multiple columns and multiple variables

Attempting to plot aggregate data from the following data.
Person Time Period Value SMA2 SMA3 SMA4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 1 14 NA NA NA
2 A 2 1 8 11 NA NA
3 A 3 1 13 10.5 11.7 NA
4 A 4 1 12 12.5 11 11.8
5 A 5 1 19 15.5 14.7 13
6 A 6 1 9 14 13.3 13.2
7 A 7 2 14 NA NA NA
8 A 8 2 7 10.5 NA NA
9 A 9 2 11 9 10.7 NA
10 A 10 2 14 12.5 10.7 11.5
# ... with 26 more rows
I have used aggregate(DataSet[,c(4,5,6,7)], by=list(DataSet$Person), na.rm = TRUE, max) to get the following:
Group.1 Value SMA2 SMA3 SMA4
1 A 20 18.0 16.66667 15.25
2 B 20 17.0 16.66667 15.00
3 C 19 18.5 14.33333 14.50
I'd like to plot the maxes for each SMA for Person A, B, and C on the same plot.
I would also like to be able to plot the mean of these maxes for each SMA column.
Any help is appreciated.
Like so? Or are you looking for something different?
df <- data.frame("Group.1"=c("A","B","C"), "Value"=c(20,20,20),
"SMA2"=c(18.0, 17.0, 18.5), "SMA3" =c(16.667, 16.667, 14.333),
"SMA4"=c(15.25, 15.00, 14.50))
library(ggplot2)
library(tidyr)
df.g <- df %>%
gather(SMA, Value, -Group.1)
df.g$SMA <- factor(df.g$SMA, levels=c("Value", "SMA2", "SMA3", "SMA4"))
means <- df.g %>%
group_by(SMA) %>%
summarise(m=mean(Value))
ggplot(df.g, aes(x=SMA, y=Value, group=Group.1, colour=Group.1)) +
geom_line() +
geom_point(data=means, aes(x=SMA, y=m), inherit.aes = F)

How to mimick ROW_NUMBER() OVER(...) in R

To manipulate/summarize data over time, I usually use SQL ROW_NUMBER() OVER(PARTITION by ...). I'm new to R, so I'm trying to recreate tables I otherwise would create in SQL. The package sqldf does not allow OVER clauses. Example table:
ID Day Person Cost
1 1 A 50
2 1 B 25
3 2 A 30
4 3 B 75
5 4 A 35
6 4 B 100
7 6 B 65
8 7 A 20
I want my final table to include the average of the previous 2 instances for each day after their 2nd instance (day 4 for both):
ID Day Person Cost Prev2
5 4 A 35 40
6 4 B 100 50
7 6 B 65 90
8 7 A 20 35
I've been trying to play around with aggregate, but I'm not really sure how to partition or qualify the function. Ideally, I'd prefer not to use the fact that id is sequential with the date to form my answer (i.e. original table could be rearranged with random date order and code would still work). Let me know if you need more details, thanks for your help!
You could lag zoo::rollapplyr with a width of 2. In dplyr,
library(dplyr)
df %>% arrange(Day) %>% # sort
group_by(Person) %>% # set grouping
mutate(Prev2 = lag(zoo::rollapplyr(Cost, width = 2, FUN = mean, fill = NA)))
#> Source: local data frame [8 x 5]
#> Groups: Person [2]
#>
#> ID Day Person Cost Prev2
#> <int> <int> <fctr> <int> <dbl>
#> 1 1 1 A 50 NA
#> 2 2 1 B 25 NA
#> 3 3 2 A 30 NA
#> 4 4 3 B 75 NA
#> 5 5 4 A 35 40.0
#> 6 6 4 B 100 50.0
#> 7 7 6 B 65 87.5
#> 8 8 7 A 20 32.5
or all in dplyr,
df %>% arrange(Day) %>% group_by(Person) %>% mutate(Prev2 = (lag(Cost) + lag(Cost, 2)) / 2)
which returns the same thing. In base,
df <- df[order(df$Day), ]
df$Prev2 <- ave(df$Cost, df$Person, FUN = function(x){
c(NA, zoo::rollapplyr(x, width = 2, FUN = mean, fill = NA)[-length(x)])
})
df
#> ID Day Person Cost Prev2
#> 1 1 1 A 50 NA
#> 2 2 1 B 25 NA
#> 3 3 2 A 30 NA
#> 4 4 3 B 75 NA
#> 5 5 4 A 35 40.0
#> 6 6 4 B 100 50.0
#> 7 7 6 B 65 87.5
#> 8 8 7 A 20 32.5
or without zoo,
df$Prev2 <- ave(df$Cost, df$Person, FUN = function(x){
(c(NA, x[-length(x)]) + c(NA, NA, x[-(length(x) - 1):-length(x)])) / 2
})
which does the same thing. If you want to remove the NA rows, tack on tidyr::drop_na(Prev2) or na.omit.

Resources