Joining duplicate columns in single dataframe [duplicate] - r

This question already has answers here:
Identifying duplicate columns in a dataframe
(10 answers)
Closed 3 years ago.
I have a dataframe where each column has a unique name, but the content of several columns is identical. The columns with identical content are all factor variables and they end in the same way (e.g. .x or .y). My goal is to join all columns with the same ending (.x or .y) into a single column.
Most solutions I have encountered in this regard combine multiple dataframes, but I have not found a solution yet that does this within a single dataframe. I am providing some example script to illustrate what my dataframe looks like at the moment and the desired output.
# generate some data
dv1 = rnorm(6)
dv2 = rnorm(6)
dv3 = rnorm(6)
# current dataframe
DF <- data.frame(dv1,
iv1.x = sort(rep(letters[1:2], 3)),
iv1.y = as.factor(c(1:6)),
dv2,
iv2.x = sort(rep(letters[1:2], 3)),
iv2.y = as.factor(c(1:6)),
dv3,
iv3.x = sort(rep(letters[1:2], 3)),
iv3.y = as.factor(c(1:6))
)
# desired dataframe
DF.cbmd <- data.frame(dv1,
dv2,
dv3,
iv1.x = sort(rep(letters[1:2], 3)),
iv1.y = as.factor(c(1:6))
)

If they are truly duplicate columns, it seems there's no use to merge them, but you can simply remove them:
dfUnique <- DF[!duplicated(as.list(DF))]

Your data frame seems to be a result of a merge. The ideal fix would be to handle this on the previous step (merging). However, another idea would be to remove everything before the . at the column names, and simply remove duplicate column names, i.e.
DF[!duplicated(gsub('.*\\.', '', names(DF)))]

Related

Mutate a dataframe by a vector which should match variable names

I have a dataframe with a vector of years and several columns which contain the gdp_per_head_values of different countries at a specific point in time. I want to mutate this dataframe to get a variable which contains only the values of the variable of the specific point in time defined by the vector of years.
My data.frame looks like this :
set.seed(123)
dataset <- tibble('country' = c('Austria','Austria','Austria','Germany','Germany','Sweden','Sweden','Sweden'),
'year_vector' = floor(sample(c(1940,1950,1960),8,replace=T)),
'1940' = runif(8,15000,18000),
'1950' = runif(8,15000,18000),
'1960' = runif(8,15000,18000),
)
How can I mutate this dataframe as explained above, for example by the variable gpd_head
EDIT : Output should look like
set.seed(123)
dataset <- tibble('country' = c('Austria','Austria','Austria','Germany','Germany','Sweden','Sweden','Sweden'),
'year_vector' = floor(sample(c(1940,1950,1960),8,replace=T)),
'1940' = runif(8,15000,18000),
'1950' = runif(8,15000,18000),
'1960' = runif(8,15000,18000)) %>%
mutate(gdp_head =c(.$'1940'[1],.$'1940'[2],.$'1960'[3],
.$'1950'[4],.$'1940'[5],.$'1960'[6],
.$'1960'[7],.$'1950'[8] ))
Here is one approach:
First, since you are going to compare the year_vector column with column names (which will be character), you can convert year_vector to character as well:
dataset$year_vector <- as.character(dataset$year_vector)
You currently have a tibble defined - but if you have it as a plain data.frame you can subset based on a [row, column] matrix and add the matched results as gdp_head:
dataset <- as.data.frame(dataset)
dataset$gdp_head <- as.numeric(dataset[cbind(1:nrow(dataset), match(dataset$year_vector, names(dataset)))])
I came up with the following solution which works aswell :
dataset %>%
do(.,mutate(.,gdp_head = pmap(list(1:nrow(.), year_vector),
function(x,y) .[x,(y-1901+16)]) %>%
unlist() ))
In this solution I just added the position of the first year variable to the column index and subtract that number from the year_vector. In this case the year variables start in the year 1901 which column index corresponds to 16.

apply a function to some columns of a data frame, while storing the result in the original data frame [duplicate]

This question already has answers here:
Coerce multiple columns to factors at once
(11 answers)
Closed 3 years ago.
I have a data frame, where I would like to render some of the columns as factor (at the moment they are numeric).
For example:
dd = data_frame( x = c(0, 0, 0, 1, 1, 1), y = c(1,2,3,4,5,6))
I would like to make only the first column a factor:
lapply(dd[,1], as.factor)
But the result is a list (of a factor), and is not saved back to the original data frame.
Is there a way to achieve this?
We can use
library(dplyr)
dd <- dd %>%
mutate(x = factor(x))
Or for multiple columns
nm1 <- names(dd)[1:2]
dd <- dd %>%
mutate_at(vars(nm1), factor)
In the OP's code, the issue is that it is looping through the first column elements into a list. Instead, we need just
dd[,1] <- factor(dd[,1])
Or
dd[[1]] <- factor(dd[[1]])
NOTE: For a single column, we don't need any lapply
If we want to apply to multiple columns
dd[nm1] <- lapply(dd[nm1], factor)

R replace identical column character items with increasing number

I have a data frame with 60000 obs. of 4 variables in the following format:
I need to replace all character items in the first column with the same character with the number 1. So "101-startups" is 1, "10i10-aps" is 2, 10x is 3 and all 10x-fund-lp are 4 and so on. The same for the second column.
How do I achieve this?
If I'm understanding your question correctly, all you need to do is something like:
my_data$col_1 <- as.integer(factor(my_data$col1, levels = unique(my_data$col1))
my_data$col_2 <- as.integer(factor(my_data$col2, levels = unique(my_data$col2))
Probably a good idea to read up on factors
Try building a separate dataframe from the unique entries of that column, then use the row names (which will be consecutive integers). If your dataframe is df and that first column is v1, something like
x = data.frame(v1 = unique(df$v1))
x$numbers = row.names(x)
Then you can do some kind of merge
final.df = merge(x, df, by = "v1")
and then using something like dplyr to select/drop/rearrange columns

Error changing column names of data frames in a list-R [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have such a list(list1) each element is a data frame which consist one column.
All columsn names are same "x". I want to change column names as "x1", "x2",....,"xn".
I use below code:
lapply(list1, function(x) setNames(x, "x",paste("x",1:seq_along(list1))))
However, this code does not work. Why does not this code work? I will be very glad for any help. Thanks a lot.
# David Arenburg, I edited code as below(10 is the elment number in list1):
lapply(list1, function(z) setNames(z,paste0("x",1:10)))
This code does not give any error but it also does not change the column names.The column names ars still "x".
I edited the as below, however, it doesn't still work.
for(i in 1:10)
{
list2[[i]]<-setNames(data.frame(list1[[i]])[,1], paste0("x",1:10)[i])
}
I removed seq_along for now. I will work on it after gettin the desired result.
Each element of list1 is a data frame and each data frame has only one column.
If you want to change the name of each column in multiple data frames that make a list, you should do the following:
# Artificial list with each data frame containing columns with values from 1 to 3
list1 = list(data.frame(x = 1:3), data.frame(x = 1:3), data.frame(x = 1:3),
data.frame(x = 1:3), data.frame(x = 1:3), data.frame(x = 1:3),
data.frame(x = 1:3), data.frame(x = 1:3), data.frame(x = 1:3),
data.frame(x = 1:3))
# Assigning column names of individual data frames
for(i in 1:length(list1)){
colnames(list1[[i]]) = paste("x",i, sep = "")
}
I created a list with multiple data frames each one of them containing same column x. Because you want to change the names of columns in data frames, assigning multiple values in setNames() to each individual data frame will be of no help. Therefore, you have to paste x with single values (from 1 to length of your list) to the column names of individual data frames.

Pad data frame with missing dates in a series [duplicate]

This question already has an answer here:
r - time series padding with NA
(1 answer)
Closed 8 years ago.
I've aggregated a data frame of rows representing events into another data frame of daily counts using aggregate(). The resultant frame is sorted by date, but it's missing days with zero counts, and I want to fill those days in to get a continuous daily series. The count frame looks something like this:
agg <- data.frame(
date = as.Date(c("2013-04-02", "2013-04-04", "2013-04-07", "2013-04-08")),
count = c(4, 2, 6, 1))
The way I previously solved this was by iterating through the frame to find non-continuous days, then rbinding subsets of the frame with an empty one. But this is an ugly solution, horrible to debug and painfully inefficient to boot. My thinking is that it would be better to generate a new data frame, populate it with the target date series...
target <- data.frame(
date = seq(from = as.Date("2013-04-01"), to = as.Date("2013-04-10"), by = "day"),
count = NA)
... and then somehow project counts from agg onto target using the matching dates. Does anyone know how I'd do this -- or have a better solution?
You're almost there. Just do:
merge(agg,target[-2],all.y=TRUE)
subset [-2] is needed to remove the count column from target, as it is not needed. Alternatively, you could do:
target <- data.frame(
date = seq(from = as.Date("2013-04-01"), to = as.Date("2013-04-10"), by = "day"))
merge(agg,target,all.y=TRUE)
As another solution, how about this?
other <- data.frame(date = seq(as.Date("2013-04-01"), as.Date("2013-04-10"), by = "day"), count = 0)
other <- filter(other, !(date %in% agg$date))
join = full_join(agg, other, by = c("date", "count")) %>% arrange(date)
It's a little messy, but it does the trick.
edit: fixed a mistake or two

Resources