how to duplicate rows by condition and replace content in R - r

I would like to dublicate rows of my data frame by testing a condition and then changing the contensts of variables.
My original data frame is this :
df <- data.frame(id = c("x", "y", "w"), decision = c("partial", "refusal", "total"),
code = c("AAA20", "AAA61", "AAA77"), `2nd_decision` = c("total", "partial", NA),
`2nd_code` = c("BBB50", "BBB89", NA), varx = c("a", "v", "p"))
id decision code 2nd_decision 2nd_code varx
x partial AAA20 total BBB50 a
y refusal AAA61 partial BBB89 v
w total AAA77 p
I would like to test each time that 2nd_decision is "partial" or "total", and if so, duplicate the row and replace the contents of the variables "decision" and "code" with "2nd_decision" and "2nd_code" ; also, I do not want to present any more the content of "2nd_decision" and "2nd_code" and keep the rest of my data frame as it was, like this:
id decision code 2nd_decision 2nd_code varx
x partial AAA20 total BBB50 a
y refusal AAA61 partial BBB89 v
w total AAA77 p
x total BBB50 a
y partial BBB89 v
Thank you in advance

Is this what you want?
df <- data.frame(id = c("x", "y", "w"), decision = c("partial", "refusal", "total"),
code = c("AAA20", "AAA61", "AAA77"), `2nd_decision` = c("total", "partial", NA),
`2nd_code` = c("BBB50", "BBB89", NA), varx = c("a", "v", "p"))
add_rows <- unique(df[, c("id", "X2nd_decision", "X2nd_code", "varx")])
colnames(add_rows) <- c("id", "decision", "code", "varx")
add_rows <- add_rows[!is.na(add_rows$decision), ]
library(plyr)
df_final <- rbind.fill(df, add_rows)
df_final

You can use mutate in combination with an ifelse statement.
Let's recreate your data first.
df <- data.frame(id = c("x", "y", "w", "x", "y"),
decision = c("partial", "refusal", "total", "total", "partial"),
code = c("AAA20", "AAA61", "AAA77", "BBB50", "BBB89"),
decision2 = c("total", "partial", NA, NA, NA),
varx = c("a", "v", "p", "a", "v"))
And here the code to test second decision and remove unwanted variable.
library(tidyverse)
dfnew <- df %>%
mutate(code = ifelse(decision2 == "total", "BBB50",
ifelse(decision2 == "partial", "BBB89", NA))) %>%
select(-decision2)

Related

Merging two dataframes based on conditions in multiple columns

I am trying to create a new df, call it df3, out of two other datasets:
df1 = data.frame("String" = c("a", "b", "c"), "Title" = c("A", "B", "C"), "Date" = c("2020-01-01", "2020-01-02", "2020-01-03"))
and:
df2 = data.frame("String" = c("a", "x", "y"), "Title" = c("ABCDEF", "XYZ", "YZ"), "Date" = c("2020-01-03", "2020-01-20", "2020-01-30"))
The conditions for the observations that should be matched, and form a new dataset, are:
df1$String %$in% df2$String
grepl(df1$Title, df2$Title) == TRUE
df1$Date < df$Date
What is the best way to do this kind of merging? I have tried to create an indicator along the lines of :
df1$indicator = ifelse(df1$String %in% df2$String & grepl(df1$Title, df2$Title) & df1$Date < df$Date, 1, 0)
or
df1$indicator = ifelse(df1$String %in% df2$String & grepl(df1$Title, df2$Title[df1$String %in% df2$String) & df1$Date < df2$Date[df1$String %in% df2$String, 1, 0)
to then use for merging, but I've been getting "longer object length is not a multiple of shorter object length" and "argument 'pattern' has length > 1 and only the first element will be used" warnings.
One way: Use a crossjoin then filter the result.
Note that grepl is not vectorized over both arguments, so i use mapply.
df1 = data.frame("String" = c("a", "b", "c"), "Title" = c("A", "B", "C"), "Date" = c("2020-01-01", "2020-01-02", "2020-01-03"))
df2 = data.frame("String" = c("a", "x", "y"), "Title" = c("ABCDEF", "XYZ", "YZ"), "Date" = c("2020-01-03", "2020-01-20", "2020-01-30"))
merge(df1,df2, by=NULL, suffixes = c(".x", ".y")) |>
subset(String.x %in% String.y
& mapply(grepl, Title.x, Title.y)
& Date.x < Date.y )
#> String.x Title.x Date.x String.y Title.y Date.y
#> 1 a A 2020-01-01 a ABCDEF 2020-01-03

Calculating a rolling return

I have a data frame with 3 columns. What I want to do is to calculate the product of the return over a selected month rolling period for each monthly period (or said another way, each row) (where available). This is the basic structure of the data.
set.seed = 100
assets <- c("A", "B", "C", "D", "E", "F", "G", "H", "I")
FileDate <- seq(as.Date("2011-12-30"), as.Date("2019-01-31"), by="months")
df <- merge(x = assets, y = FileDate, all.x = TRUE)
df$return <- runif(774, min=0, max=1)
What it should end with is a dataframe where a new column is added with the selected period cumulative return for that time frame. For example, I have shown below a four month return. The calculation of the 4-month return on 03/30/2012 from the data would be:
((1+0.81/100)(1+0.715/100)(1+0.27/100)*(1+0.80/100)-1)*100
This would be repeated for each value under the X column.
I ended up utilizing the mutate function there you can set the lag width. in the end version I wanted
library(dplyr)
library(zoo)
# Create Test Dataframe
set.seed = 100
assets <- c("A", "B", "C", "D", "E", "F", "G", "H", "I")
FileDate <- seq(as.Date("2011-12-30"), as.Date("2019-01-31"), by="months")
df <- merge(x = assets, y = FileDate, all.x = TRUE)
df$performance <- runif(774, min=0, max=1)
This particular code creates a 5 month average on a rolling basis. If you sort by column X you can see and recreate it in excel.
df <- df %>%
group_by(x) %>%
mutate(x_mean = rollmean(performance, 5, fill = NA, align = 'right'))
I also found a way to create a lag so I could take the 4 prior values to the observation and calculate the mean:
df2 = df %>%
mutate(perf.4.previous = rollapply(data = perf.1.previous, width = 4, FUN =
mean, align = "right", fill = NA, na.rm = T))

R merging Partial match

there are a lot of answers about that, but I didn't find out with the problem that I am handle.
I have 2 dataframes:
df1:
df2:
setA <- read.table("df1.txt",sep="\t", header=TRUE)
setB <- read.table("df2.txt",sep="\t", header=TRUE)
So, I want the matching rows by column value:
library(data.table)
setC <-merge(setA, setB, by.x = "name", by.y = "name", all.x = FALSE)
And I get this output:
df3:
Because in df I have also de value 1, but separete with a ";". How can I get the desire output?
Thanks!!
In future please apply the function dput(df1) and dput(df2) and copy and paste the output from the console into your question.
Base R solution to two part question:
# First unstack the 1;7 row into two separate rows:
name_split <- strsplit(df1$name, ";")
# If the values of last vector uniquely identify each row in the dataframe:
df_ro <- data.frame(name = unlist(name_split),
last = rep(df1$last, sapply(name_split, length)),
stringsAsFactors = FALSE)
# Left join to achieve the same result as first solution
# without specifically naming each vector:
df1_ro <- merge(df1[,names(df1) != "name"], df_ro, by = "last", all.x = TRUE)
# Then perform an inner join preventing a name space collision:
df3 <- merge(df1_ro, setNames(df2, paste0(names(df2), ".x")),
by.x = "name", by.y = "name.x")
# If you wanted to perform an inner join on all intersecting columns (returning
# no results because values in last and colour are different then):
df3 <- merge(df1_ro, df2, by = intersect(names(df1_ro), names(df2)))
Data:
df1 <- data.frame(name = c("1;7", "3", "4", "5"),
last = c("p", "q", "r", "s"),
colour = c("a", "s", "d", "f"), stringsAsFactors = FALSE)
df2 <- data.frame(name = c("1", "2", "3", "4"),
last = c("a", "b", "c", "d"),
colour = c("p", "q", "r", "s"), stringsAsFactors = FALSE)
At the end I achieved with this solution:
co=open('NewFile.txt','w')
f=open('IndexFile.txt','r')
g=open('File.txt','r')
tabla1 = f.readlines()
tabla2 = g.readlines()
B=[]
for ln in tabla1:
B = ln.split('\t')[3]
for k, ln2 in enumerate(tabla2):
if B in ln2.split('\t')[3]:
xx=ln2
print(xx)
co.write(xx)
break
co.close()

R multiple choice questionnaire data to ggplot

I have a Qualtrics multiple choice question that I want to use to create graphs in R. My data is organized so that you can answer multiple answers for each question. For example, participant 1 selected multiple choice answers 1 (Q1_1) & 3 (Q1_3). I want to collapse all answer choices in one bar graph, one bar for each multiple response option (Q1_1:Q1_3) divided by the number of respondents who answered this question (in this case, 3).
df <- structure(list(Participant = 1:3, A = c("a", "a", ""), B = c("", "b", "b"), C = c("c", "c", "c")), .Names = c("Participant", "Q1_1", "Q1_2", "Q1_3"), row.names = c(NA, -3L), class = "data.frame")
I want to use ggplot2 and maybe some sort of loop through Q1_1: Q1_3?
Perhaps this is what you want
f <-
structure(
list(
Participant = 1:3,
A = c("a", "a", ""),
B = c("", "b", "b"),
C = c("c", "c", "c")),
.Names = c("Participant", "Q1_1", "Q1_2", "Q1_3"),
row.names = c(NA, -3L),
class = "data.frame"
)
library(tidyr)
library(dplyr)
library(ggplot2)
nparticipant <- nrow(f)
f %>%
## Reformat the data
gather(question, response, starts_with("Q")) %>%
filter(response != "") %>%
## calculate the height of the bars
group_by(question) %>%
summarise(score = length(response)/nparticipant) %>%
## Plot
ggplot(aes(x=question, y=score)) +
geom_bar(stat = "identity")
Here is a solution using ddply from dplyr package.
# I needed to increase number of participants to ensure it works in every case
df = data.frame(Participant = seq(1:100),
Q1_1 = sample(c("a", ""), 100, replace = T, prob = c(1/2, 1/2)),
Q1_2 = sample(c("b", ""), 100, replace = T, prob = c(2/3, 1/3)),
Q1_3 = sample(c("c", ""), 100, replace = T, prob = c(1/3, 2/3)))
df$answer = paste0(df$Q1_1, df$Q1_2, df$Q1_3)
summ = ddply(df, c("answer"), summarize, freq = length(answer)/nrow(df))
## Re-ordeing of factor levels summ$answer
summ$answer <- factor(summ$answer, levels=c("", "a", "b", "c", "ab", "ac", "bc", "abc"))
# Plot
ggplot(summ, aes(answer, freq, fill = answer)) + geom_bar(stat = "identity") + theme_bw()
Note : it might be more complicated if you have more columns relating to other questions ("Q2_1", "Q2_2"...). In this case, melting data for each question could be a solution.
I think you want something like this (proportion with a stacked bar chart):
Participant Q1_1 Q1_2 Q1_3
1 1 a c
2 2 a a c
3 3 c b c
4 4 b d
# ensure that all question columns have the same factor levels, ignore blanks
for (i in 2:4) {
df[,i] <- factor(df[,i], levels = c(letters[1:4]))
}
tdf <- as.data.frame(sapply(df[2:4], function(x)table(x)/sum(table(x))))
tdf$choice <- rownames(tdf)
tdf <- melt(tdf, id='choice')
ggplot(tdf, aes(variable, value, fill=choice)) +
geom_bar(stat='identity') +
xlab('Questions') +
ylab('Proportion of Choice')

How to concatenate multiple columns with a coma between them

I have the following data frame in r
ID COL.1 COL.2 COL.3 COL.4
1 a b
2 v b b
3 x a n h
4 t
I am new to R and I don't understand how to call the data fram in order to have this at the end, another problem is that i have more than 100 columns
stream <- c("1,a,b","2,v,b,b","3,x,a,n,h","4,t")
another problem is that I have more than 100 columns .
Try this
Reduce(function(...)paste(...,sep=","), df)
Where df is your data.frame
This might be what you're looking for, even though it's not elegant.
my_df <- data.frame(ID = seq(1, 4, by = 1),
COL.1 = c("a", "v", "x", "t"),
COL.2 = c("b", "b", "a", NULL),
COL.3 = c(NULL, "b", "n", NULL),
COL.4 = c(NULL, NULL, "h", NULL))
stream <- substring(paste(my_df$ID,
my_df$COL.1,
my_df$COL.2,
my_df$COL.3,
my_df$COL.4,
sep =","), 3)
stream <- gsub(",NA", "", stream)
stream <- gsub("NA,", "", stream)

Resources