filling rows with former numbers R - r

I do have a dataframe, which is a result of a merge (all =TRUE) and looks like this one (where the merge is conducted by Groupname, ObservationName and Date, the 2 Treatment columns come from the x :
A <- data.frame(GroupName = c(rep(c("A", "B", "C"), each = 6)),
ObservationName = c("alpha", "beta", "gamma", "alpha", "beta", "gamma", rep(c("delta", "epsilon"),3), rep(c("zeta", "eta", "theta"),2)),
Date = rep(rep(seq(as.Date("2010-1-1"), as.Date("2010-3-1"), by = "month"), each =3), 2),
Value = runif(n = 18, min = 1, max = 10),
Treatment1 = rep(NA, 18),
Treatment2 = rep(NA, 18))
A[c(1, 5, 6, 10, 12,13),5] <- 1
A[c(1, 5, 6, 10, 12,13),6] <- c(1, 3, 5, 7, 3, 4)
A[c( 7, 10 , 14), c(1,2,4)] <- NA
I would like to carry the values of my Treatment1 and Treatment 2 on. Namely I want to group my dfs by Groupname and Observationname and order it by Date column. If Treatment1 has a one in a earlier observation of that group, all later Treatments should have a one as well. In Treatment2 the numbers shall cumulate. That mean: in row 1,2,3,4 should be 1, in row 5 should be 4 (since 1 + 3) and in row 6 there should be 9 (since 1 +3+5). and so on. Thanks for help.
One of my tries with dplyr is:
A %>% group_by(GroupName, ObservationName) %>%
arrange(Date) %>%
mutate(Treatment1 = sum(Treatment1),
Treatment1cm = cummax(Treatment1)) %>%
ungroup()
but that does not override the NAs.
The aim is to delete all the rows where only treatment1 and Treatment 2 is given, since the (or value is NA) but all information a took over.

Related

How do I create two new variables out of one variable, and attach dummy values to it in R?

I am completely new to any kind of coding, nevermind R in particular, so my days of googling have not been very helpful. I would really appreciate any kind of help/insights!
I would like to know how to get two new variables out of the original variable, and attach new values to it - basically I start with this:
and want to obtain this:
I managed to get it in long format with melt(dataname, id.vars=c("ID")) and the ID & value I get are good. But there is only one variable with my four headers (loudHot, quietHot, loudCold, quietCold) repeated - how do I create two new variables out of this and assign the values to it (e.g. that "Volume" has the value 1 when the original variable is loudHot or loudCold and 0 if its quietHot or quietCold, and then "Temp" is 1 when the original variable is loudHot or quietHot and 0 when its loudCold or quietCold)?
I wouldn't be too hard on yourself - this isn't really trivial. Anyway, you can use pivot_longer from tidyr and some data manipulation with dplyr to achieve your desired outcome:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-ID) %>%
mutate(Volume = as.numeric(grepl("loud", name)),
Temp = as.numeric(grepl("Hot", name))) %>%
select(ID, Volume, Temp, value)
#> # A tibble: 32 x 4
#> ID Volume Temp value
#> <dbl> <dbl> <dbl> <dbl>
#> 1 2 1 1 14
#> 2 2 0 1 16
#> 3 2 1 0 16
#> 4 2 0 0 15
#> 5 4 1 1 19
#> 6 4 0 1 15
#> 7 4 1 0 10
#> 8 4 0 0 8
#> 9 6 1 1 11
#> 10 6 0 1 17
#> # ... with 22 more rows
Data
df <- data.frame(ID = (1:8) * 2,
loudHot = c(14, 19, 11, 20, 18, 17, 16, 2),
quietHot = c(16, 15, 17, 5, 10, 10, 15, 0),
loudCold = c(16, 10, 10, 4, 3, 2, 14, 2),
quietCold = c(15, 8, 17, 8 ,10, 12, 5, 0))
As a tip for any future SO questions, please don't post images of data. Folks here need to be able to cut and paste the text of your data to test and verify solutions. Ideally, you should do this by the output of the dput function into a code block. People rarely go to the effort of manually transcribing data from your images.
Created on 2022-02-04 by the reprex package (v2.0.1)
Lest approach your problem using dplyr an tidyr packages.
The first recommendation for you is to always add a minimal reproducible example of your data in order for us to use it and help you faster. This is not complicated, you can use the dput(head(yourdata, 10)), for example, or simulate some observations.
I did a simulation as follow:
library(dplyr)
library(tidyr)
data <- data.frame(
id = 1:5,
loudHot = sample(10:20, 5, replace = TRUE),
quieHot = sample(10:20, 5, replace = TRUE),
loudCold = sample(0:12, 5, replace = TRUE),
quiteCold = sample(0:12, 5, replace = TRUE)
)
Now that we have the data, lest turn it to long format using tidyr::pivot_longer. This function recibe as argument the dataframe in wide format, de columns you want to gather (or those you do not want to gather using the - symbol).
# Data to long format
data_long <- pivot_longer(
data, cols = -id,
names_to = 'variable', values_to = 'value'
)
With that, now you only have to create the dummys, which is simple.
# Adding new variables
data_with_dummy <- mutate(
data_long,
volume = as.numeric(variable %in% c('loudHot', "loudCold")),
temp = as.numeric(variable %in% c('loudHot', "quietCold"))
)
Here's a base R approach:
# Original data
df <- data.frame(
ID = c(2, 4, 5, 7, 8, 11, 12, 16),
loudHot = c(14, 19, 11, 20, 18, 17, 16, 2),
quietHot = c(16, 15, 17, 5, 10, 10, 15, 0),
loudCold = c(16, 10, 10, 4, 3, 2, 14, 2),
quietCold = c(15, 8, 17, 8, 10, 12, 5, 0)
)
# Stacked data
df_stacked <- stack(
df,
select = c(
"loudHot", "quietHot", "loudCold", "quietCold"
)
)
# New variable for volume
df_stacked$Volume <- as.numeric(grepl("loud", df_stacked$ind))
# New variable for Temp
df_stacked$Temp <- as.numeric(grepl("Hot", df_stacked$ind))
# Replace "ind" values with "ID"
df_stacked$ind <- rep(df$ID, times = 4)
# Reorder columns
new_df <- df_stacked[,c(2:4,1)]
# Rename columns
colnames(new_df) <- c("ID", "Volume", "Temp", "Value")
# Order by ID
new_df[order(new_df$ID),]
I believe your columns for "Volume" and "Temp" should be alternating sequences:

How to find a minimum of two columns based on a condition?

I have a dataframe in R looking like that
ID1 <- c(1,2,3,4,5,6,7,8,9)
Value1 <- c(2,3,5,2,5,8,17,3,5)
ID2 <- c(1,2,3,4,5,6,7,8,9)
Value2 <- c(4,6,3,5,8,1,2,8,10)
df <- as.data.frame(cbind(ID1,Value1,ID2,Value2))
Now I am searching for the minimum value of the sum of Value1 and Value2 which has a sum of ID1 and ID2 equal or smaller than 9. Thus, it should show me the minimum of the combination of Value1 + Value2 (not needed to be within the same row) without exceding 9 as the sum of ID1+ID2.
The result should point me to the combination of x in Value1 and y in Value2, which together are the lowest potential values under the condition that ID1+ID2 are <=9.
Thanks in advance!
One possibility
library(dplyr)
goodrow <- filter(df, ID1 + ID2 <= 9) %>% mutate(sumval = Value1 + Value2) %>% filter(sumval == min(sumval))
If I understand well your question, consider using the crossing function. This will compute all the combination of ID1 and ID2
library(dplyr)
df <- as.data.frame(cbind(ID1,Value1))
df2 <- as.data.frame(cbind(ID2,Value2))
df_test <- crossing(df, df2)
goodrow <- filter(df_test, ID1 + ID2 <= 9) %>% mutate(sumval = Value1 + Value2) %>% filter(sumval == min(sumval))
For your specific case
which.min(rowSums(df[rowSums(df[,c("ID1","ID2")])<10,c("Value1","Value2")]))
You can use a SQL query to answer the question with the sqldf package
library(sqldf)
#> Loading required package: gsubfn
#> Loading required package: proto
#> Loading required package: RSQLite
df <- structure(list(ID1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9), Value1 = c(2,
3, 5, 2, 5, 8, 17, 3, 5), ID2 = c(1, 2, 3, 4, 5, 6, 7, 8, 9),
Value2 = c(4, 6, 3, 5, 8, 1, 2, 8, 10)), class = "data.frame", row.names = c(NA,
-9L))
# just get min sum
sqldf('
select
min(a.Value1 + b.Value2) as min_sum
from
df a
join df b
on a.ID1 + b.ID2 <= 9
')
#> min_sum
#> 1 3
# show the rows where min sum occurs
sqldf('
select
a.Value1
, b.Value2
, a.ID1
, b.ID2
from
df a
join df b
on a.ID1 + b.ID2 <= 9
group by
1 = 1
having
a.Value1 + b.Value2 = min(a.Value1 + b.Value2)
')
#> Value1 Value2 ID1 ID2
#> 1 2 1 1 6
Created on 2021-11-15 by the reprex package (v2.0.1)
Another one liner,
filter(transform(df, 'new' = df$Value1 + df$Value2),(df$ID1 + df$ID2 <=9)&(new == min(new)))

R: Count how many times value has occured before within certain range of rows

I have a dataframe like this:
df <- data.frame("subj.no" = rep(1:3, each = 24),
"trial.no" = rep(1:3, each = 8, length.out = 72),
"item" = c(rep(c("ball", "book"), 4), rep(c("doll", "rope"), 4), rep(c("fish", "box"), 4), rep(c("paper", "candle"), 4), rep(c("horse", "marble"), 4), rep(c("doll", "rope"), 4), rep(c("tree", "dog"), 4), rep(c("ball", "book"), 4), rep(c("horse", "marble"), 4)),
"rep.no" = rep(1:4, each = 2, length.out = 72),
"DV" = c(1,0,1,0,1,0,0,1,1,0,1,0,0,0,1,0,1,0,1,0,1,0,0,0,0,1,1,1,1,0,0,1,0,1,1,0,0,1,0,1,1,1,0,1,0,0,
1,0,0,1,1,0,1,0,0,1,1,1,1,0,0,0,0,0,0,1,0,1,0,1,1,0),)
I now want to create another column DV.no which says that the value 1 occurred the nth time within that combination of subj.no, trial.no and item. For DV==0, the value in the new column should be 0.
So the resulting vector should look like this:
DV.no = c(1,0,2,0,3,0,0,1,1,0,2,0,0,0,3,0,1,0,2,0,3,0,0,0,0,1,1,2,2,0,0,3,0,1,1,0,0,2,0,3,1,1,0,2,0,0,2,0,0,1,1,0,2,0,0,2,1,1,2,0,0,0,0,0,0,1,0,2,0,3,1,0)
So basically, for each unique combination of values in subj.no, trial.no and item, whenever the value of DV is 1, then 1 should be added to the count in the new variable.
(Remark: The column rep.no is not part of the relevant value combination. But it's in the df anyway, and since I didn't know if it's useful for the solution, I left it there.)
How can this be done in R?
We can do a group by cumsum on the 'DV' column
library(dplyr)
df %>%
group_by(subj.no, trial.no, item) %>%
mutate(V.no = cumsum(DV)* DV)
Or in base R with ave
df$V.no <- with(df, DV *ave(DV, subj.no, trial.no, item, FUN = cumsum))

Lag in multiindex time series panel data on R

I have a dataset with hundreds of rows structured like this
User Date Value1 Value2
A 2012-01-01 4 3
A 2012-01-02 5 7
A 2012-01-03 6 1
A 2012-01-04 7 4
B 2012-01-01 2 4
B 2012-01-02 3 2
B 2012-01-03 4 9
B 2012-01-04 5 3
As the panel data has two indices (User=k, Date=t), I struggle to run a regression on R where the dependent variable (Value 1) is lagged only on the time index. the regression should be performed as follows:
Value1(k,t+1) ~ Value2(k,t)
or
Value1(k,t) ~ Value2(k,t-1)
Any suggestions?
For every user, you can do:
> df <- data.frame(User = c(rep("A", 4), rep("B", 4)),
+ Date = rep(seq.Date(as.Date("2012-01-01"), as.Date("2012-01-04"), by = "day"), 2),
+ Value1 = c(4, 5, 6, 7, 2, 3, 4, 5),
+ Value2 = c(3, 7, 1, 4, 4, 2, 9, 3))
>
> df_A <- df[df$User == "A", c("Value1", "Value2")]
> ts_A <- ts(df_A, start = c(2012, 1, 1), frequency = 365)
> ts_A <- ts.intersect(ts_A, lag(ts_A, -1))
> colnames(ts_A) <- c("Value1", "Value2", "Value1_t_1", "Value2_t_1")
>
> lm(Value1 ~ Value2_t_1, ts_A)
Call:
lm(formula = Value1 ~ Value2_t_1, data = ts_A)
Coefficients:
(Intercept) Value2_t_1
6.3929 -0.1071
>
Hope it helps.
Here's a solution using the dplyr package, you may notice in the code below I explicitly reference the lag function from dplyr as opposed to base R (stats). This is because the lag function from dplyr does not require a time series input.
I would also note that the two formulas you list may produce different regression results as you will be running them over different sets of data i.e.
Value1(k,t+1) ~ Value2(k,t) : run on the time period of 1-01-2012 to 1-03-2012
Value1(k,t) ~ Value2(k,t-1) : run on the time period of 1-02-2012 to 1-04-2012
library("tidyverse")
df <- data.frame(User = c(rep("A", 4), rep("B", 4)),
Date = rep(seq.Date(as.Date("2012-01-01"), as.Date("2012-01-04"), by = "day"), 2),
Value1 = c(4, 5, 6, 7, 2, 3, 4, 5),
Value2 = c(3, 7, 1, 4, 4, 2, 9, 3))
df2 <- df %>% arrange(User,Date) %>%
group_by(User) %>%
mutate(lag_v2 = dplyr::lag(Value2),
lead_v1 = dplyr::lead(Value1))
df3<-df2[!is.na(df2$lag_v2),]
df4<-df2[!is.na(df2$lead_v1),]
summary(lm(Value1~lag_v2,data=df3))
summary(lm(lead_v1~Value2,data=df4))

Find unique max values in data.frame by row also when two or more

I have several data.frames and I have to find the max value given a certain column. Some data.frames have a unique max value but others have two or more unique max values.
How can I print the rows with max values of such data.frames?
Some fake data:
#### Simple case with only one unique max value
df = data.frame(x = c(1,1,1,1,2,2,2,2,2), y = c(10, 10, 10, 10, 10, 10, 10, 9, 9))
df = data.frame(table(df$y))
df$Var1 = as.numeric(levels(df$Var1))[df$Var1]
max_val = df[which.max(df$Freq),]
print(max_val)
Var1 Freq
2 10 7
#### Unknown case with two unique max values
df_2 = data.frame(x = c(1,1,1,1,2,2,2,2,2), y = c(10, 10, 10, 9, 9, 9, 11, 11, 15))
df_2 = data.frame(table(df_2$y))
df_2$Var1 = as.numeric(levels(df_2$Var1))[df_2$Var1]
Desired output from df_2
Var1 Freq
1 9 3
2 10 3
Thanks for any help
select where the Freq is the same as the max
df_2[df_2$Freq == max(df_2$Freq),]
# Var1 Freq
#1 9 3
#2 10 3

Resources