Renaming and merging columns based on an old/new name dataset - r

A confusing title.
Best explained by an example.
I have the following data
df <- "Green.Apple Red.Apple Pears Orange Lemon Lime
1 3 5 4 4 0 5
2 3 0 2 7 2 11
3 2 7 8 0 3 1
4 0 6 3 5 6 0 "
df <-read.table(text=df,header=T)
I would like to rename the columns based on an old / new names, and then merge those columns based on the old and new names. If a column being renamed is also the same as another column they would be summed. I bring the names into the workspace:
names <- "Original New
1 Green.Apple Apple
2 Red.Apple Apple
3 Pears Pear
4 Orange Orange
5 Lemon Cirtus
6 Lime Cirtus"
#
names <-read.table(text=names,header=T)
I have tried various work around methods. e.g. they will always have the same length of names so one could simply rename the columns by a list, but this is not proper and could result in errors in the larger task I am trying to accomplish.
This is what I am looking for:
yay <- "Apple Pear Orange Cirtus
1 8 4 4 5
2 3 2 7 13
3 9 8 0 4
4 6 3 5 6"
Many thanks
Jim
(controversial: Also open to a Pandas alternative)

You could also do:
names(df) <- names$New[match(names(df), names$Original)]
t(rowsum(t(df), group = colnames(df), na.rm = T))
# > t(rowsum(t(df), group = colnames(df), na.rm = T))
# Apple Cirtus Orange Pear
# 1 8 5 4 4
# 2 3 13 7 2
# 3 9 4 0 8
# 4 6 6 5 3

Use match to match old names with new names and rename df. Then use split.default to split based on similar names and sum similar columns.
names(df) <- names$New[match(names(df), names$Original)]
sapply(split.default(df, names(df)), rowSums)
# Apple Cirtus Orange Pear
#1 8 5 4 4
#2 3 13 7 2
#3 9 4 0 8
#4 6 6 5 3

Related

I would like to extract the columns of each element of a list in R

I'd to extract the 3rd column (c) of each element in this list and store the result.
(I've listed the data frame in this example so that it looks like the long list of lists I have):
set.seed(59)
df<- data.frame(a=c(1,4,5,2),b=c(9,2,7,4),c=c(5,2,9,4))
df1<- data.frame(df,2*df)
df1<- list(df,2*df)
[[1]]
a b c
1 1 9 5
2 4 2 2
3 5 7 9
4 2 4 4
[[2]]
a b c
1 2 18 10
2 8 4 4
3 10 14 18
4 4 8 8
Seems fairly simple for just one element
> df1[[1]]["c"]
c
1 5
2 2
3 9
4 4
> df1["c"] # cries again
[[1]]
NULL
All I want to see is:
[[1]]
c
1 5
2 2
3 9
4 4
[[2]]
c
1 10
2 4
3 18
4 8
Thanks in advance
Use lapply :
data <- lapply(df1, function(x) x[, 'c', drop = FALSE])
data
#[[1]]
# c
#1 5
#2 2
#3 9
#4 4
#[[2]]
# c
#1 10
#2 4
#3 18
#4 8
When you subset one column dataframe it coerces it to lowest possible dimension which is a vector in this case. drop = FALSE is needed to keep it as a dataframe.

Change the order of numerically named columns in r

If I have a dataframe like the one below which has numerical column names
example = data.frame(1=c(1,8,3,9), 2=c(3,2,3,3), 3=c(5,2,5,4), 4=c(1,2,3,4), 5=c(2,5,7,8))
Which looks like this:
1 2 3 4 5
1 3 5 1 2
8 2 2 2 5
3 3 5 3 7
9 3 4 4 8
And I want to arrange it so that the column names start with three and proceed through five and back to one, like this:
3 4 5 1 2
5 1 2 1 3
2 2 5 8 2
5 3 7 3 3
4 4 8 9 3
I know how to rearrange the position of a single column in a dataset, but I'm not sure how to do this with more than one column in this particular order.
We can use the column index concatenated (c) based on the sequence (:) on a range of values
example[c(3:5, 1:2)]
# 3 4 5 1 2
#1 5 1 2 1 3
#2 2 2 5 8 2
#3 5 3 7 3 3
#4 4 4 8 9 3
As the column names are all numeric, just convert to numeric and use that for ordering
v1 <- as.numeric(names(example))
example[c(v1[3:5], v1[1:2])]
Or simply do
example[c(names(example)[3:5], names(example)[1:2])]
Or another way is with head and tail
example[c(tail(names(example), 3), head(names(example), 2))]
data
example <- data.frame(`1`=c(1,8,3,9), `2`=c(3,2,3,3),
`3`=c(5,2,5,4), `4`=c(1,2,3,4), `5`=c(2,5,7,8), check.names = FALSE)
R will not easily let you create columns with numbers as name. If somehow, you are able to create columns with numbers you can use match to get order in which you want the column names.
example[match(c(3:5, 1:2), names(example))]
# 3 4 5 1 2
#1 5 1 2 1 3
#2 2 2 5 8 2
#3 5 3 7 3 3
#4 4 4 8 9 3

r: efficient way to merge data when there is common column and combine data when there is no common column

I have two data frames with duplicate columns, data1 and data2. I am now running a for loop and every loop merges one column in data1 with the whole columns in data2. For example
data1:
1 1 3 4 4
2 5 2 4 2
2 2 8 8 0
data2
1 4 5 4 5
2 9 3 4 5
2 7 4 8 0
columns 1 and 4 are duplicate in data1 and data2. For the first loop, it merges
1
2
2
with data2
1 4 5 4 5
2 9 3 4 5
2 7 4 8 0
so the desired result is
1 4 5 4 5
2 9 3 4 5
2 7 4 8 0
Then it goes to the second column
1
5
2
and it merges with data2
1 4 5 4 5
2 9 3 4 5
2 7 4 8 0
The desired result will be
1 1 4 5 4 5
5 2 9 3 4 5
2 2 7 4 8 0
My idea is to use combine or merge function, but these two functions do not achieve the desired output
for(i in 1:dim(data[2])){
datam_merge<- merge(data1[i], data2)
}
Any suggestion is appreciated!
This should do the trick:
data3 <- dplyr::left_join(data2, data1)
head(data3)
The left_join() function is determining which columns data2 has in common with data1, and then only joining the dissimilar columns from data1 to data2.
I noticed that your "desired result" is dropping column 5 from data1. Was this intentional, or is your desired output a new dataframe that has all of the columns from data1 and data2 without any duplicates?
This is another approach that may be a more generalized solution:
data3 <- dplyr::inner_join(data1, data2)
This only joins the unique columns between either of the two dataframes instead of just data1.
Let me know if this is what you were looking for!
Edit:
Here's my example:
data1 <- data.frame(c(1,2,2),c(1,5,2),c(3,2,8),c(4,4,8),c(4,2,0))
names(data1) <- c("A","B","C","D","E")
data2 <- data.frame(c(1,2,2),c(4,9,7),c(5,3,4),c(4,4,8),c(5,5,0))
names(data2) <- c("A","F","G","D","H")
## columns 'A' and 'D' are in common, but we only need one of each letter ('A' through 'E').
data3 <- left_join(data2, data1)
head(data3)
A F G D H B C E
1 1 4 5 4 5 1 3 4
2 2 9 3 4 5 5 2 2
3 2 7 4 8 0 2 8 0

R, ggvis graphing from two data frames that both need to be grouped by

I'm creating a stacked bar graph with multiple horizontal lines running through it. This is done in a Shiny app. The user picks an option and depending on what it is, there could be either 2 or 3 horizontal lines.
here is a minimal reproducible example:
df1 <- data.frame(a=as.factor(rep(1:10,2)),
b=sample(1:5,20, replace=T),
c=c(rep("apple",10), rep("banana",10)) )
df1 <- df1[order(df1$a, df1$c),]
df2 <- data.frame(a=as.factor(rep(1:10,2)),
i=c(rep(3,10),rep(4,10)),
j=c(rep("red",10), rep("green",10)) )
> df1
a b c
1 1 5 apple
11 1 2 banana
2 2 3 apple
12 2 3 banana
3 3 1 apple
13 3 2 banana
4 4 3 apple
14 4 1 banana
5 5 4 apple
15 5 3 banana
6 6 4 apple
16 6 2 banana
7 7 3 apple
17 7 4 banana
8 8 5 apple
18 8 1 banana
9 9 5 apple
19 9 2 banana
10 10 1 apple
20 10 3 banana
> df2
a i j
1 1 3 red
2 2 3 red
3 3 3 red
4 4 3 red
5 5 3 red
6 6 3 red
7 7 3 red
8 8 3 red
9 9 3 red
10 10 3 red
11 11 3 red
12 1 4 green
13 2 4 green
14 3 4 green
15 4 4 green
16 5 4 green
17 6 4 green
18 7 4 green
19 8 4 green
20 9 4 green
21 10 4 green
22 11 4 green
ggvis(data=df1, x=~a, y=~b) %>%
group_by(c) %>%
layer_bars(fill=~c) %>%
layer_paths(data=df2, x=~a, y=~i, strokeWidth:=2)
which gives me the following graph (it'll look different each time because of sample() ).
But I don't want the inverse Z in the middle. What I want is two parallel lines that are grouped by df2$j. But I'm not sure how to go about that with two data frames in my ggvis.
The reason I have df2 in a long form is because the user could choose an option that would create more than 2 horizontal lines. I don't want to use if and else to control for that. In my actual code, df1 and df2 are both reactives.
Thank you for your help in advance.
You can give layer_paths a dataset grouped on your y variable so the horizontal lines will be drawn separately for each group.
To do this, you can use data = group_by(df2, i) instead of data = df2.
And your code and plot would look like:
ggvis(data=df1, x=~a, y=~b) %>%
group_by(c) %>%
layer_bars(fill=~c) %>%
layer_paths(data = group_by(df2, i), x = ~a, y = ~i, strokeWidth:=2)

Separate unique and duplicate entries in dataframe based off id

I have a dataframe with an id variable, which may be duplicated. I want to split this into two dataframes, one which contains only the entries where the id's are duplicated, the other which shows only the id's which are unique. What is the best way of doing this?
For example, say I had the data frame:
dataDF <- data.frame(id = c(1,1,2,3,4,4,5,6),
a = c(1,2,3,4,5,6,7,8),
b = c(8,7,6,5,4,3,2,1))
i.e. the following
id a b
1 1 1 8
2 1 2 7
3 2 3 6
4 3 4 5
5 4 5 4
6 4 6 3
7 5 7 2
8 6 8 1
I want to get the following dataframes:
id a b
1 1 1 8
2 1 2 7
5 4 5 4
6 4 6 3
and
id a b
3 2 3 6
4 3 4 5
7 5 7 2
8 6 8 1
I am currently doing this as follows
dupeIds <- unique(subset(dataDF, duplicated(dataDF$id))$id)
uniqueDF <- subset(dataDF, !id %in% dupeIds)
dupeDF <- subset(dataDF, id %in% dupeIds)
which seems to work but it seems a bit off to subset three times, is there a simpler way of doing this? Thanks
Use duplicated twice, once top down, and once bottom up, and then use split to get it all in a list, like this:
split(dataDF, duplicated(dataDF$id) | duplicated(dataDF$id, fromLast = TRUE))
# $`FALSE`
# id a b
# 3 2 3 6
# 4 3 4 5
# 7 5 7 2
# 8 6 8 1
#
# $`TRUE`
# id a b
# 1 1 1 8
# 2 1 2 7
# 5 4 5 4
# 6 4 6 3
If you need to split this out into separate data.frames in your workspace (not sure why you would need to do that), assign names to the list items (eg names(mylist) <- c("nodupe", "dupe")) and then use list2env.

Resources