I have a data frame where some columns have the same data, but different column names. I would like to remove duplicated columns, but merge the column names. An example, where test1 and test4 columns are duplicates:
df
test1 test2 test3 test4
1 1 1 0 1
2 2 2 2 2
3 3 4 4 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
and I would like the result to be something like this:
df
test1+test4 test2 test3
1 1 1 0
2 2 2 2
3 3 4 4
4 4 4 4
5 5 5 5
6 6 6 6
Here is the data:
structure(list(test1 = c(1, 2, 3, 4, 5, 6), test2 = c(1, 2, 4,
4, 5, 6), test3 = c(0, 2, 4, 4, 5, 6), test4 = c(1, 2, 3, 4,
5, 6)), .Names = c("test1", "test2", "test3", "test4"), row.names = c(NA,
-6L), class = "data.frame")
Please note that I do not simply want to remove duplicated columns. I also want to merge the column names of the duplicated columns, after the duplicates are removed.
I could do it manually for the simple table I posted, but I want to use this on large datasets, where I don't know in advance what columns are identical. I do not what to remove and rename columns manually, since I might have over 50 duplicated columns.
Ok, improving on the above answer using the idea from here. Save the duplicate and non-duplicate columns into data frames. Check to see if the non-duplicates match any duplicates, and if so concatenate their columns names. So this will now work if you have more than two duplicate columns.
Editted: Changed summary to digest. This helps with character data.
df <- structure(list(test1 = c(1, 2, 3, 4, 5, 6), test2 = c(1, 2, 4,
4, 5, 6), test3 = c(0, 2, 4, 4, 5, 6), test4 = c(1, 2, 3, 4,
5, 6)), .Names = c("test1", "test2", "test3", "test4"), row.names = c(NA,
-6L), class = "data.frame")
library(digest)
nondups <- df[!duplicated(lapply(df, digest))]
dups <- df[duplicated(lapply(df, digest))]
for(i in 1:ncol(nondups)){
for(j in 1:ncol(dups)){
if(FALSE %in% paste0(nondups[,i] == dups[,j])) NULL
else names(nondups)[i] <- paste(names(nondups[i]), names(dups[j]), sep = "+")
}
}
nondups
Example 2, as a function.
Editted: Changed summary to digest and return non-duplicated and duplicated data frames.
age <- 18:29
height <- c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender <- c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe <- data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender, gender3 = gender)
dupcols <- function(df = testframe){
nondups <- df[!duplicated(lapply(df, digest))]
dups <- df[duplicated(lapply(df, digest))]
for(i in 1:ncol(nondups)){
for(j in 1:ncol(dups)){
if(FALSE %in% paste0(nondups[,i] == dups[,j])) NULL
else names(nondups)[i] <- paste(names(nondups[i]), names(dups[j]), sep = "+")
}
}
return(list(df1 = nondups, df2 = dups))
}
dupcols(df = testframe)
Editted: This section is new.
Example 3: On a large data frame
#Creating a 1500 column by 15000 row data frame
dat <- do.call(data.frame, replicate(1500, rep(FALSE, 15000), simplify=FALSE))
names(dat) <- 1:1500
#Fill the data frame with LETTERS across the rows
#This part may take a while. Took my PC about 23 minutes.
start <- Sys.time()
fill <- rep(LETTERS, times = ceiling((15000*1500)/26))
j <- 0
for(i in 1:nrow(dat)){
dat[i,] <- fill[(1+j):(1500+j)]
j <- j + 1500
}
difftime(Sys.time(), start, "mins")
#Run the function on the created data set
#This took about 4 minutes to complete on my PC.
start <- Sys.time()
result <- dupcols(df = dat)
difftime(Sys.time(), start, "mins")
names(result$df1)
ncol(result$df1)
ncol(result$df2)
It's not completely automated, but the output of the loop will identify pairs of duplicate columns. You'll then have to remove one of the duplicate columns and then re-name based on what columns were duplicates.
df <- structure(list(test1 = c(1, 2, 3, 4, 5, 6), test2 = c(1, 2, 4,
4, 5, 6), test3 = c(0, 2, 4, 4, 5, 6), test4 = c(1, 2, 3, 4,
5, 6)), .Names = c("test1", "test2", "test3", "test4"), row.names = c(NA,
-6L), class = "data.frame")
for(i in 1:(ncol(df)-1)){
for(j in 2:ncol(df)){
if(i == j) NULL
else if(FALSE %in% paste0(df[,i] == df[,j])) NULL
else print(paste(i, j, sep = " + "))
}
}
new <- df[,-4]
names(new)[1] <- paste(names(df[1]), names(df[4]), sep = "+")
new
Related
I have a large data frame in R with over 200 participants who have answered 152 questions. Now, I want to insert a column based on a conditional query after each "Answer" column. As an example, I have the following data frame:
data <- data.frame(Participants = 1:5,
Answer1 = c(4, 6, 2, 2, 3),
Answer2 = c(5, 1, 3, 5, 4))
I now want to insert a new conditional column of "Confidence" after each "Answer" column. For the column of "Answer1", the query would look like this:
data$Confidence1 <- ifelse(data$Answer1 == 1 | data$Answer1 == 6, 2, ifelse(data$Answer1 == 2 | data$Answer1 == 5, 1, 0))
In the end, I want the data frame to look like this:
data <- data.frame(Participants = 1:5,
Answer1 = c(4, 6, 2, 2, 3),
Confidence1 = c(0, 2, 1, 1, 0),
Answer2 = c(5, 1, 3, 5, 4),
Confidence2 = c(1, 2, 0, 1, 0))
Does anyone have an idea on how to achieve this for all "Answer" columns at once? Thanks!
Here is one solution using the modify() function from the purrr package the response by #r2evans from merge two data table into one, with alternating columns in R
library(purrr)
# create data, adding one more answer to your example
data <- data.frame(Participants = 1:5,
Answer1 = c(4, 6, 2, 2, 3),
Answer2 = c(5, 1, 3, 5, 4),
Answer3 = c(3, 2, 3, 4, 5))
# make a new df containing only the answer columns from data
answers <- data[,2:4]
# make confidence df and give it correct names
conf <- modify(answers, function(x) ifelse(x == 1 | x == 6, 2, ifelse(x == 2 | x == 5, 1, 0)))
names(conf) <- paste0("Confidence",1:ncol(answers))
# set order
neworder <- order(c(2*(seq_along(answers) - 1) + 1,
2*seq_along(conf)))
# put it together
cbind(Participants = data[,1], cbind(answers, conf)[,neworder])
is it possible to do something like this in R (assuming both df1 and df2 have the same number of rows?
if (df1$var1 = 8) df2$var1 = 1.
if (df1$var2 = 9) df2$var2 = 1.
A simple two line code can be done with Base R ifelse statement
df1 <- data.frame(var1 = c(1:10), var2 = c(1:10))
df2 <- data.frame(var1 = c(1:10), var2 = c(1:10))
df2$var1 <- ifelse(df1$var1 == 8, 1,df2$var1)
df2$var2 <- ifelse(df1$var2 == 9, 1,df2$var2)
Here is one simple option in base R, where we replicate the values 8, 9 to make the lengths same and compare with the subset of columns of 'df1', resulting in a logical matrix. Subset the 'df2' and assign those columns to 1
nm1 <- c('var1', 'var2')
df2[nm1][df1[nm1] == c(8, 9)[col(df1[nm1])]] <- 1
df2
# var1 var2 var3
#1 5 1 1
#2 3 1 2
#3 1 3 3
#4 1 4 4
#5 4 2 5
Or this can be done in two steps
df2$var1[df1$var1 == 8] <- 1
df2$var2[df1$var2 == 9] <- 1
Or using Map
df2[nm1] <- Map(function(x, y, z) replace(x, y == z, 1),
df2[nm1], df1[nm1], c(8, 9))
The if/else loop can be also done, but it is not vectorized i.e. it expects input to be of length 1. If we do a loop, then it can be done (but would be inefficient in R)
vals <- c(8, 9)
for(i in seq_len(nrow(df1))) {
for(j in seq_along(nm1)) {
if(df1[[nm1[j]]][i] == vals[j]) df2[[nm1[j]]][i] <- 1
}
}
data
df1 <- data.frame(var1 = c(1, 3, 8, 5, 2), var2 = c(9, 3, 1, 8, 4),
var3 = 1:5)
df2 <- data.frame(var1 = c(5, 3, 2, 1, 4), var2 = c(3, 1, 3, 4, 2),
var3 = 1:5)
I have a dataset with hundreds of rows structured like this
User Date Value1 Value2
A 2012-01-01 4 3
A 2012-01-02 5 7
A 2012-01-03 6 1
A 2012-01-04 7 4
B 2012-01-01 2 4
B 2012-01-02 3 2
B 2012-01-03 4 9
B 2012-01-04 5 3
As the panel data has two indices (User=k, Date=t), I struggle to run a regression on R where the dependent variable (Value 1) is lagged only on the time index. the regression should be performed as follows:
Value1(k,t+1) ~ Value2(k,t)
or
Value1(k,t) ~ Value2(k,t-1)
Any suggestions?
For every user, you can do:
> df <- data.frame(User = c(rep("A", 4), rep("B", 4)),
+ Date = rep(seq.Date(as.Date("2012-01-01"), as.Date("2012-01-04"), by = "day"), 2),
+ Value1 = c(4, 5, 6, 7, 2, 3, 4, 5),
+ Value2 = c(3, 7, 1, 4, 4, 2, 9, 3))
>
> df_A <- df[df$User == "A", c("Value1", "Value2")]
> ts_A <- ts(df_A, start = c(2012, 1, 1), frequency = 365)
> ts_A <- ts.intersect(ts_A, lag(ts_A, -1))
> colnames(ts_A) <- c("Value1", "Value2", "Value1_t_1", "Value2_t_1")
>
> lm(Value1 ~ Value2_t_1, ts_A)
Call:
lm(formula = Value1 ~ Value2_t_1, data = ts_A)
Coefficients:
(Intercept) Value2_t_1
6.3929 -0.1071
>
Hope it helps.
Here's a solution using the dplyr package, you may notice in the code below I explicitly reference the lag function from dplyr as opposed to base R (stats). This is because the lag function from dplyr does not require a time series input.
I would also note that the two formulas you list may produce different regression results as you will be running them over different sets of data i.e.
Value1(k,t+1) ~ Value2(k,t) : run on the time period of 1-01-2012 to 1-03-2012
Value1(k,t) ~ Value2(k,t-1) : run on the time period of 1-02-2012 to 1-04-2012
library("tidyverse")
df <- data.frame(User = c(rep("A", 4), rep("B", 4)),
Date = rep(seq.Date(as.Date("2012-01-01"), as.Date("2012-01-04"), by = "day"), 2),
Value1 = c(4, 5, 6, 7, 2, 3, 4, 5),
Value2 = c(3, 7, 1, 4, 4, 2, 9, 3))
df2 <- df %>% arrange(User,Date) %>%
group_by(User) %>%
mutate(lag_v2 = dplyr::lag(Value2),
lead_v1 = dplyr::lead(Value1))
df3<-df2[!is.na(df2$lag_v2),]
df4<-df2[!is.na(df2$lead_v1),]
summary(lm(Value1~lag_v2,data=df3))
summary(lm(lead_v1~Value2,data=df4))
I have a list of lists containing multiple data frames. I would like to transpose the data frames and leave the lists structured as is.
The data is setup in this format (from:John McDonnell):
parent <- list(
a = list(
foo = data.frame(first = c(1, 2, 3), second = c(4, 5, 6)),
bar = data.frame(first = c(1, 2, 3), second = c(4, 5, 6)),
puppy = data.frame(first = c(1, 2, 3), second = c(4, 5, 6))
),
b = list(
foo = data.frame(first = c(1, 2, 3), second = c(4, 5, 6)),
bar = data.frame(first = c(1, 2, 3), second = c(4, 5, 6)),
puppy = data.frame(first = c(1, 2, 3), second = c(4, 5, 6))
)
)
This works when a single list of data frames is used, but not for a list of lists:
a_tran <- lapply(a, function(x) {
t(x)
})
Any thoughts on how to modify?
You could use modify_depth from purrr
library(purrr)
modify_depth(.x = parent, .depth = 2, .f = ~ as.data.frame(t(.)))
#$a
#$a$foo
# V1 V2 V3
#first 1 2 3
#second 4 5 6
#$a$bar
# V1 V2 V3
#first 1 2 3
#second 4 5 6
#$a$puppy
# V1 V2 V3
#first 1 2 3
#second 4 5 6
#$b
# ...
A base R option that #hrbrmstr initially posted in a comment would be
lapply(parent, function(x) lapply(x, function(y) as.data.frame(t(y))))
I'm new to r programming, i need to achieve below desire output can you please help me.
dataframe:
ID Name
1 null
2 list(A = 10, B = 20)
2 list(G = 4, U = 2)
3 null
3 null
4 list(A = 7, B = 10)
Desired Output will be,
ID Measure Measure.A Measure.B
1 null null null
2 list(A = 10, B = 20) 10 20
2 list(A = 4, B = 2) 4 2
3 null null null
3 null null null
4 list(A = 7, B = 10) 7 10
It is better to have NA instead of NULL elements in a data.frame. Loop through the 'Name' column (assuming it is a list with nested list elements), replace the NULL (assuming it is real NULL and not a character string "null") with NA and rbind the elements using do.call. Assign the output to create two new columns in 'df1'
df1[c("Measure.A", "Measure.B")] <- unlist(do.call(rbind,
lapply(df1$Name, function(x) replace(x, is.null(x), NA))))
names(df1)[2] <- "Measure"
data
df1 <- structure(list(ID = c(1, 2, 2, 3, 3, 4), Name = structure(list(
NULL, structure(list(A = 10, B = 20), .Names = c("A", "B"
)), structure(list(G = 4, U = 2), .Names = c("G", "U")),
NULL, NULL, structure(list(A = 7, B = 10), .Names = c("A",
"B"))), class = "AsIs")), .Names = c("ID", "Name"), row.names = c(NA,
-6L), class = "data.frame")