Combine rows of data frame in R using colMeans? - r

I'm impressed by the number of "how to combine rows/columns" threads, but even more by the fact that none of these was particularly helpful or at least not applicable to my issue.
My data look like this:
MyData<-data.frame("id" = c("a","a","b"),
"value1_1990" = c(5,NA,1),
"value2_1990" = c(5,NA,2),
"value1_2000" = c(2,1,1),
"value2_2000" = c(2,1,2),
"value1_2010" = c(NA,9,1),
"value2_2010" = c(NA,9,2))
What I want to do is to combine the two rows where id=="a" for columns MyData[,(2:7)] using base R's colMeans.
What it looks like:
id value1_1990 value2_1990 value1_2000 value2_2000 value1_2010 value2_2010
1 a 5 5 2 2 NA NA
2 a NA NA 1 1 9 9
3 b 1 2 1 2 1 2
What I need:
id value1_1990 value2_1990 value1_2000 value2_2000 value1_2010 value2_2010
1 a 5 5 1.5 1.5 9 9
2 b 1 2 1 2 1 2
What I tried (among numerous other things):
MyData[nrow(MyData)+1, 2:7] = colMeans(MyData[which(MyData$id=="a"),(2:7)],na.rm=T) # to combine values from rows where id=="a"
MyData$id<-ifelse(is.na(MyData$id),"NewRow",MyData$id) # to replace "<NA>" in the id-column of the newly created row by "NewRow".
This works, except for the fact that...
...it turns all other existing id's into numeric values (and I don't want to let the second line of code -- the ifelse-statement -- touch any of the existing id's, which is why I wrote else==MyData$id).
...this is not particulary fancy code. Is there a one-line-of-code-solution that does the trick? I saw other approaches using aggregate() but this didn't work for me.

You can try using dplyr:
library(dplyr)
Possible solution:
MyData %>% group_by(id) %>% summarise_all(funs(mean(., na.rm = TRUE)))

Related

Newbie working on Horse Racing Database using R

I'm new to the group and to R language.
I've written some code (below) that achieves the desired result.
However, i'm aware that i'm reproducing lines of the same code which would surely be more efficiently coded using a for loop.
Also, there will be races with large numbers of horses so I really need to be able to run a for loop that runs through each horse.
ie. num_runners = NROW(my_new_data)
my_new_data holds data on horses previous races.
DaH is a numeric rating that is attached to each of a horse's previous runs with DaH1 being the most recent and DaH6 is six races back.
Code, a character, signifies the type of race that the horse competed in. ie. Flat, Fences.
I have played with using for loops, ie. for(i in 1:6) without success.
Since I am assigning to a new horse each time I would hope something such as the following would work:
horse(i) = c(my_new_data$DaH1[i],my_new_data$DaH2[i],my_new_data$DaH3[i],my_new_data$DaH4[i],my_new_data$DaH5[i],my_new_data$DaH6[i])
But I know that horse(i) is not allowed.
Would my best strategy be to pre-define a dataframe of size: 6 rows and 6 columns
and use 2 for loops to populate [row][column]? Something like:
final_data[i,j]
Here is the code I am presently using which creates the dataframe racetest:
horse1 = c(my_new_data$DaH1[1],my_new_data$DaH2[1],my_new_data$DaH3[1],my_new_data$DaH4[1],my_new_data$DaH5[1],my_new_data$DaH6[1])
horse2 = c(my_new_data$DaH1[2],my_new_data$DaH2[2],my_new_data$DaH3[2],my_new_data$DaH4[2],my_new_data$DaH5[2],my_new_data$DaH6[2])
horse3 = c(my_new_data$DaH1[3],my_new_data$DaH2[3],my_new_data$DaH3[3],my_new_data$DaH4[3],my_new_data$DaH5[3],my_new_data$DaH6[3])
horse4 = c(my_new_data$DaH1[4],my_new_data$DaH2[4],my_new_data$DaH3[4],my_new_data$DaH4[4],my_new_data$DaH5[4],my_new_data$DaH6[4])
horse5 = c(my_new_data$DaH1[5],my_new_data$DaH2[5],my_new_data$DaH3[5],my_new_data$DaH4[5],my_new_data$DaH5[5],my_new_data$DaH6[5])
horse6 = c(my_new_data$DaH1[6],my_new_data$DaH2[6],my_new_data$DaH3[6],my_new_data$DaH4[6],my_new_data$DaH5[6],my_new_data$DaH6[6])
horse1.code = c(my_new_data$Code1[1],my_new_data$Code2[1],my_new_data$Code3[1],my_new_data$Code4[1],my_new_data$Code5[1],my_new_data$Code6[1])
horse2.code = c(my_new_data$Code1[2],my_new_data$Code2[2],my_new_data$Code3[2],my_new_data$Code4[2],my_new_data$Code5[2],my_new_data$Code6[2])
horse3.code = c(my_new_data$Code1[3],my_new_data$Code2[3],my_new_data$Code3[3],my_new_data$Code4[3],my_new_data$Code5[3],my_new_data$Code6[3])
horse4.code = c(my_new_data$Code1[4],my_new_data$Code2[4],my_new_data$Code3[4],my_new_data$Code4[4],my_new_data$Code5[4],my_new_data$Code6[4])
horse5.code = c(my_new_data$Code1[5],my_new_data$Code2[5],my_new_data$Code3[5],my_new_data$Code4[5],my_new_data$Code5[5],my_new_data$Code6[5])
horse6.code = c(my_new_data$Code1[6],my_new_data$Code2[6],my_new_data$Code3[6],my_new_data$Code4[6],my_new_data$Code5[6],my_new_data$Code6[6])
racetest = data.frame(horse1,horse1.code,horse2,horse2.code, horse3, horse3.code,
horse4,horse4.code,horse5,horse5.code, horse6, horse6.code)
Thanks in advance for any help that can be offered!
Graham
using loops in R is usually not the correct approach. Still I will give you something which might work.
There are two possible approaches I see here, I will address the simpler one:
if columns are ordered such that column 1:6 are named DaH1 to DaH6 and columns 7: 12 are the ones named horse1.code etc... in this case:
library(magrittr)
temp<- cbind(my_new_data[,1:6] %>% t,
my_new_data[,7:12]%>% t)
Odd = seq(1,12,2)
my_new_data[ , Odd] = temp[,1:6]
my new_data[ , -Odd] = temp[,7:12]
#cleanup
rm(temp,Odd)
my_new_data should now contain your desired output. Before you run this, make sure your data is backed up inside another object as this is untested code.
Actually we want to reshape the wide format of the data in a different wide format. But first let's look at your desired for loop approach, to understand what's going on.
Using a loop
For the loop we'll need two variables with sequences i and j.
## initialize matrix with dimnames
racetest <- matrix(NA, 3, 6,
dimnames=list(c("DaH1", "DaH2", "DaH3"),
c("horse1", "horse1.code", "horse2", "horse2.code",
"horse3", "horse3.code")))
## loop
for (i in 0:2) {
for (j in 1:3) {
racetest[j, 1:2+2*i] <- unlist(my_new_data[i+1, c(1, 4)])
}
}
# horse1 horse1.code horse2 horse2.code horse3 horse3.code
# DaH1 1 1 2 2 3 3
# DaH2 1 1 2 2 3 3
# DaH3 1 1 2 2 3 3
Often for loops are discouraged in R, because they might be slow and doesn't use the vectorized features of the R language. Moreover they can also be tricky to program.
Transposing column sets
We also could do a different approach. Actually we want to transpose the DaH* and Code* column sets (identifiable using grep) and bring them in the appropriate order using substring of names, with nchar as first character.
rownames(my_new_data) <- paste0("horse.", seq(nrow(my_new_data)))
rr <- data.frame(DaH=t(my_new_data[, grep("DaH", names(my_new_data))]),
Code=t(my_new_data[, grep("Code", names(my_new_data))]))
rr <- rr[order(substring(names(rr), nchar(names(rr))))]
rr
# DaH.horse.1 Code.horse.1 DaH.horse.2 Code.horse.2 DaH.horse.3 Code.horse.3
# DaH1 1 1 2 2 3 3
# DaH2 1 1 2 2 3 3
# DaH3 1 1 2 2 3 3
Reshaping data
Last but not least, we actually want to reshape the data. For this we give the data set an ID variable.
my_new_data <- transform(my_new_data, horse=1:nrow(my_new_data))
At first, we reshape the data into "long" format, using the new ID variable horse and put the two varying column sets into a list.
rr1 <- reshape(my_new_data, idvar="horse", varying=list(1:3, 4:6), direction="long", sep="",
v.names=c("DaH", "Code"))
rr1
# horse time DaH Code
# 1.1 1 1 1 1
# 2.1 2 1 2 2
# 3.1 3 1 3 3
# 1.2 1 2 1 1
# 2.2 2 2 2 2
# 3.2 3 2 3 3
# 1.3 1 3 1 1
# 2.3 2 3 2 2
# 3.3 3 3 3 3
Then, in order to get the desired wide format, what we want is to swap idvar and timevar, where our new idvar is "time" and our new timevar is "horse".
reshape(rr1, timevar="horse", idvar="time", direction= "wide")
# time DaH.1 Code.1 DaH.2 Code.2 DaH.3 Code.3
# 1.1 1 1 1 2 2 3 3
# 1.2 2 1 1 2 2 3 3
# 1.3 3 1 1 2 2 3 3
Benchmark
The benchmark reveals that of these three approaches transposing of the matrices is fastest, while the 'for' loop is actually by far the slowest.
# Unit: microseconds
# expr min lq mean median uq max neval cld
# forloop 7191.038 7373.5890 8381.8036 7576.678 7980.4320 46677.324 100 c
# transpose 620.748 656.0845 707.7248 692.953 733.1365 944.773 100 a
# reshape 2791.710 2858.6830 3013.8372 2958.825 3118.4125 3871.960 100 b
Toy data:
my_new_data <- data.frame(DaH1=1:3, DaH2=1:3, DaH3=1:3, Code1=1:3, Code2=1:3, Code3=1:3)

How to compare two variable columns with each other in R?

I'm new to R and need help! I have many variables including Response and RightResponse.
I need to compare those two columns, and create a new column that can show whether there is a match or miss between each of the value pairs.
Thanks.
Perhaps something like this?
library(magrittr)
library(dplyr)
> res <- data.frame(Response=c(1,4,4,3,3,6,3),RightResponse=c(1,2,4,3,3,6,5))
> res <- res %>% mutate("CorrectOrNot" = ifelse(Response == RightResponse, "Correct","Incorrect"))
> res
Response RightResponse CorrectOrNot
1 1 1 Correct
2 4 2 Incorrect
3 4 4 Correct
4 3 3 Correct
5 3 3 Correct
6 6 6 Correct
7 3 5 Incorrect
Basically the mutate function has created a new column containing the results of a comparison between Response and RightResponse.
Hope this helps!

How to conditionally combine data.frame object in the list in more elegant way?

I have data.frame in the list, and I intend to merge specific data.frame objects conditionally where merge second, third data.frame objects without duplication, then merge it with first data.frame objects. However, I used rbind function to do this task, but my approach is not elegant. Can anyone help me out the improve the solution ? How can I achieve more compatible solution that can be used in dynamic functional programming ? How can I get desired output ? Any idea ?
reproducible example:
dfList <- list(
DF.1 = data.frame(red=c(1,2,3), blue=c(NA,1,2), green=c(1,1,2)),
DF.2 = data.frame(red=c(2,3,NA), blue=c(1,2,3), green=c(1,2,4)),
DF.3 = data.frame(red=c(2,3,NA,NA), blue=c(1,2,NA,3), green=c(1,2,3,4))
)
dummy way to do it:
rbind(dfList[[1L]], unique(rbind(dfList[[2L]], dfList[[3L]])))
Apparently, my attempt is not elegant to apply in functional programming. How can make this happen elegantly ?
desired output :
red blue green
1 1 NA 1
2 2 1 1
3 3 2 2
11 2 1 1
21 3 2 2
31 NA 3 4
6 NA NA 3
How can I improve my solution more elegantly and efficiently ? Thanks in advance
The best (easiest and fastest way) to do this is data.table::rbindlist.
It would work like this:
library(data.table)
dfList <- list(
DF.1 = data.table(red=c(1,2,3), blue=c(NA,1,2), green=c(1,1,2)),
DF.2 = data.table(red=c(2,3,NA), blue=c(1,2,3), green=c(1,2,4)),
DF.3 = data.table(red=c(2,3,NA,NA), blue=c(1,2,NA,3), green=c(1,2,3,4))
)
# part 1: list element 1
dt_1 <- dfList[[1]]
# part 2: all other list elements (in your case 2 and 3)
dt_2 <- unique(rbindlist(dfList[-1]))
# use rbindlist to bind the rows together
dt_all <- rbindlist(list(dt_1, dt_2))
Comment.
My solution is pretty close to your proposed solution. I think the "ugliness" about this way is that it is an edge case to merge datasets and deattach the first element (and treat it in a different way). The best solution would probably be to step back and think about the underlying idea and solve it using an additional variable in the datasets (i.e., for df1 and then for df2_3), which I would consider the R-way.
Something along this thought would look like this:
myList2 <- list(
DF.1 = data.table(red=c(1,2,3), blue=c(NA,1,2), green=c(1,1,2), var = "df1"),
DF.2 = data.table(red=c(2,3,NA), blue=c(1,2,3), green=c(1,2,4), var = "other"),
DF.3 = data.table(red=c(2,3,NA,NA), blue=c(1,2,NA,3), green=c(1,2,3,4), var = "other")
)
dt <- rbindlist(myList2)
unique(dt)
# red blue green var
# 1: 1 NA 1 df1
# 2: 2 1 1 df1
# 3: 3 2 2 df1
# 4: 2 1 1 other
# 5: 3 2 2 other
# 6: NA 3 4 other
# 7: NA NA 3 other
A way of rbinding a list of data.frames with only base R is do.call(list, rbind) (see this question that also presents some alternatives).
If you then desire only unique rows you can follow-up with a unique
unique(do.call(dfList, rbind))

Remove Columns with Grep and Piping

Using the example data:
d1<-data.frame(years=c("1","5","10"),group.x=c(1:3),group.b=c(1:3),group.2x=c(1:3))
I removed the columns using the following:
d2<-d1[,-grep("\\.x$",colnames(d1))]
Is there a similar way to accomplish the same using piping instead?
d2<- d1 %>%
filter(!grep("\\.x$",colnames()))
returns: Error: argument "x" is missing, with no default
The goal is to remove the columns ending with ".x"
In dplyr, filter will remove rows and select is used to remove/select columns which is what you want here. You can stick with grep but dplyr also offers some specialized functions like in this case ends_with:
library(dplyr)
d1 %>%
select(-ends_with(".x"))
# years group.b group.2x
#1 1 1 1
#2 5 2 2
#3 10 3 3
Take a look at help("select") to find out more about other "special functions" you can use inside dplyr::select.
Simply leave d1 in the select as follows:
d1 %>%
select(-grep("\\.x$",colnames(d1)))
Results in :
years group.b group.2x
1 1 1
5 2 2
10 3 3

Reshape data frame from wide to long with re-occuring column names in R

I'm trying to convert a data frame from wide to long format using the melt formula. The challenge is that I have multiple column names that are labeled the same. When I use the melt function, it drops the values from the repeat column. I have read similar questions and it was advised to use the reshape function, however I was not able to get it work.
To reproduce my starting data frame:
conversion.id<-c("1", "2", "3")
interaction.num<-c("1","1","1")
interaction.num2<-c("2","2","2")
conversion.id<-as.data.frame(conversion.id)
interaction.num<-as.data.frame(interaction.num)
interaction.num2<-as.data.frame(interaction.num2)
conversion<-c(rep("1",3))
conversion<-as.data.frame(conversion)
df<-cbind(conversion.id,interaction.num, interaction.num2, conversion)
names(df)[3]<-"interaction.num"
The data frame looks like the following:
When I run the following melt function:
melt.df<-melt(df,id="conversion.id")
It drops the interaction.num == 2 column and looks something like this:
The data frame I want is the following:
I saw the following post, but I'm not too familiar with the reshape function and wasn't able to get it to work.
How to reshape a dataframe with "reoccurring" columns?
And to add a layer of complexity, I'm looking for a method that is efficient. I need to perform this on a data frame that is around a 1M rows with many columns labeled the same.
Any advice would be greatly appreciated!
Here is a solution using tidyr instead of reshape2. One of the advantages is the gather_ function, which takes character vectors as inputs. So, first we can replace all the "problematic" variable names with unique names (by adding numbers to the end of each name) and then we can gather (the equivalent of melt) these specific variables. The unique names of the variables are stored in a temporary variable called "prob_var_name", which I removed at the end.
library(tidyr)
library(dplyr)
var_name <- "interaction.num"
problem_var <- df %>%
names %>%
equals(var_name) %>%
which
replaced_names <- mapply(paste0,names(df)[problem_var],seq_along(problem_var))
names(df)[problem_var] <- replaced_names
df %>%
gather_("prob_var_name",var_name,replaced_names) %>%
select(-prob_var_name)
conversion.id conversion interaction.num
1 1 1 1
2 2 1 1
3 3 1 1
4 1 1 2
5 2 1 2
6 3 1 2
Thanks to the quoting ability of gather_, you could wrap all this into a function and set var_name to a variable. Then maybe you could use it on all of your duplicated variables?
Here's a solution using data.table. You just have to provide the index instead of names.
require(data.table)
require(reshape2)
ans <- melt(setDT(df), measure=2:3,
value.name="interaction.num")[, variable := NULL]
# conversion.id conversion interaction.num
# 1: 1 1 1
# 2: 2 1 1
# 3: 3 1 1
# 4: 1 1 2
# 5: 2 1 2
# 6: 3 1 2
You can get the indices 2:3 by doing grep("interaction.num", names(df)).
Here's an approach in base R that should work for you:
x <- grep("interaction.num", names(df)) ## as suggested by Arun
## Make more friendly names for reshape
names(df)[x] <- paste(names(df)[x], seq_along(x), sep = "_")
## Reshape
reshape(df, direction = "long",
idvar=c("conversion.id", "conversion"),
varying = x, sep = "_")
# conversion.id conversion time interaction.num
# 1.1.1 1 1 1 1
# 2.1.1 2 1 1 1
# 3.1.1 3 1 1 1
# 1.1.2 1 1 2 2
# 2.1.2 2 1 2 2
# 3.1.2 3 1 2 2
Another possibility is stack instead of reshape:
x <- grep("interaction.num", names(df)) ## as suggested by Arun
cbind(df[-x], stack(lapply(df[x], as.character)))
The lapply(df[x], as.character) may not be necessary depending on if your values are actually numeric or not. The way you created this sample data, they were factors.

Resources