I have some code that works but is very clunky and I'm sure there is a better way to do it, avoiding the for loop. Essentially I have a list of performances, and a list of factors. And I want to assign the highest performance to the highest factors, the lowest performance to the lowest factors, etc. Here is some simplified sample code:
#My simplified sample list of performances:
PerformanceList <- data.frame(v1 <- c(rep(10,4)), v2 <- c(rep(9,4)), v3 <- c(rep(8,4)))
View(PerformanceList)
v1 v2 v3
1 10 9 8
2 10 9 8
3 10 9 8
4 10 9 8
#My simplified sample list of Factors:
MyFactors <- data.frame(v1 <- c(35,25,15,5), v2 <- c(10,20,60,20), v3 <- c(5,10,15,40))
View(MyFactors)
v1 v2 v3
1 35 10 5
2 25 20 10
3 15 60 50
4 5 20 40
#Code to find the ranking of each row from largest to smallest:
Rankings <- data.frame(t(apply(-MyFactors, 1, rank, na.last="keep",ties.method="random")))
View(Rankings)
v1 v2 v3
1 1 2 3
2 1 2 3
3 3 1 2
4 3 2 1
Function to sort each row by ranking. I assume there is a better way to do this but I couldn't figure it out:
SortFunction <- function(RankingList){
SortedRankings <- order(RankingList)
return(SortedRankings)
}
#applying that Sort function to each row of the data frame:
SortedRankings <- data.frame(t(apply(Rankings, 1,SortFunction)))
View(SortedRankings)
X1 X2 X3
1 1 2 3
2 1 2 3
3 2 3 1
4 3 2 1
Here is a for loop that does what I want but I'm sure it's not the best way to do it. Basically I want to go down each row of my PerformanceList and choose the column that corresponds to the highest Ranking (which is column 1 from my Sorted Rankings above). I'd ideally like to then be able to assign column 2 from those Sorted Rankings to assign the second highest performance to my second highest factor, and so on...
FactorPerformanceList <- data.frame(matrix(NA, ncol=1, nrow=NROW(Rankings)))
for (i in 1:NROW(Rankings)){
FactorPerformanceList[i,] <- PerformanceList[i,SortedRankings[i,1]]
}
View(FactorPerformanceList)
1 10
2 10
3 9
4 8
It seems like this should work but it gives a matrix of 4 rows by 4 columns instead:
FactorPerformanceList2 <- PerformanceList[,SortedRankings[,1]]
View(FactorPerformanceList2)
v1 v1 v2 v3
1 10 10 9 8
2 10 10 9 8
3 10 10 9 8
4 10 10 9 8
Any ideas or help would be greatly appreciated! Thank you!
This technically does not remove the for-loop, it just hides it. That said, it's a lot cleaner code than what you have, and unless you need all the intermediate data steps, it simplifies things greatly.
PerformanceList <- data.frame(
v1= c(rep(10,4)),
v2= c(rep(9,4)),
v3 = c(rep(8,4))
)
MyFactors <- data.frame(
v1 = c(35,25,15,5),
v2 = c(10,20,60,20),
v3 = c(5,10,15,40))
FactorPerformanceList <- as.data.frame(t(sapply(1:nrow(PerformanceList), function(i) {
PerformanceList[i,order(MyFactors[i,])]
})))
The same code can be written
library(tidyverse)
FactorPerformanceList <- 1:nrow(PerformanceList) %>%
sapply(function(i) {
PerformanceList[i,order(MyFactors[i,])]
}) %>%
t() %>%
as.data.frame()
which makes the order of operations a little clearer (sapply, then t, then as.data.frame).
In general, for-loops can be avoided completely when you're working with columns, but row-wise operations aren't as easy to remove entirely. You can clean up the code by using the apply family of functions, or (if you want something fancier) the plyr or purrr packages.
Given the lack of clarity I've come up with a somewhat flexible answer for you.
It might make sense to take a given data.frame and force it to take a long format, we can make sure we maintain the index positions from the prior structure as this is what you might use to join other data.frames to one another.
I've chosen to use the tidyverse suite of packages to answer this, namely dplyr.
Data
library(tidyverse)
PerformanceList <- data.frame(v1 = c(rep(10,4)), v2 = c(rep(9,4)), v3 = c(rep(8,4)))
MyFactors <- data.frame(v1 = c(35,25,15,5), v2 = c(10,20,60,20), v3 = c(5,10,15,40))
This function will take a data.frame and provide a long format data.frame with index position columns.
Function to convert to long data.frame with index ranks
df_ranks <- function(df) {
names(df) <- 1:ncol(df)
df %>%
mutate(row_index = 1:nrow(.)) %>%
gather(col_index, value, -row_index) %>%
group_by(row_index) %>%
mutate(row_rank = rank(value, na.last = "keep", ties.method = "random")) %>%
group_by(col_index) %>%
mutate(col_rank = rank(value, na.last = "keep", ties.method = "random")) %>%
ungroup()
}
Applying the function to the data, and making sure to adjust column names will let us join without much hassle.
ranked_perf <- df_ranks(PerformanceList) %>% setNames(paste0("rank_", names(.)))
ranked_fact <- df_ranks(MyFactors) %>% setNames(paste0("fact_", names(.)))
We can then join the tables, its important to understand what you want to do and what the expected result may be before this step. For this example I've said that I want to have the matching values within a column by its rank.
full_join(ranked_perf, ranked_fact,
by = c("rank_col_rank" = "fact_col_rank",
"rank_col_index" = "fact_col_index"))
As to what you want to do with this result is up to you, you can select columns and manipulate it back to wide format using combinations of select, unite, and spread.
Related
So I currently face a problem in R that I exactly know how to deal with in Stata, but have wasted over two hours to accomplish in R.
Using the data.frame below, the result I want is to obtain exactly the first observation per group, while groups are formed by multiple variables and have to be sorted by another variable, i.e. the data.frame mydata obtained by:
id <- c(1,1,1,1,2,2,3,3,4,4,4)
day <- c(1,1,2,3,1,2,2,3,1,2,3)
value <- c(12,10,15,20,40,30,22,24,11,11,12)
mydata <- data.frame(id, day, value)
Should be transformed to:
id day value
1 1 10
1 2 15
1 3 20
2 1 40
2 2 30
3 2 22
3 3 24
4 1 11
4 2 11
4 3 12
By keeping only one of the rows with one or multiple duplicate group-identificators (here that is only row[1]: (id,day)=(1,1)), sorting for value first (so that the row with the lowest value is kept).
In Stata, this would simply be:
bys id day (value): keep if _n == 1
I found a piece of code on the web, which properly does that if I first produce a single group identifier :
mydata$id1 <- paste(mydata$id,"000",mydata$day, sep="") ### the single group identifier
myid.uni <- unique(mydata$id1)
a<-length(myid.uni)
last <- c()
for (i in 1:a) {
temp<-subset(mydata, id1==myid.uni[i])
if (dim(temp)[1] > 1) {
last.temp<-temp[dim(temp)[1],]
}
else {
last.temp<-temp
}
last<-rbind(last, last.temp)
}
last
However, there are a few problems with this approach:
1. A single identifier needs to be created (which is quickly done).
2. It seems like a cumbersome piece of code compared to the single line of code in Stata.
3. On a medium-sized dataset (below 100,000 observations grouped in lots of about 6), this approach would take about 1.5 hours.
Is there any efficient equivalent to Stata's bys var1 var2: keep if _n == 1 ?
The package dplyr makes this kind of things easier.
library(dplyr)
mydata %>% group_by(id, day) %>% filter(row_number(value) == 1)
Note that this command requires more memory in R than in Stata: in R, a new copy of the dataset is created while in Stata, rows are deleted in place.
I would order the data.frame at which point you can look into using by:
mydata <- mydata[with(mydata, do.call(order, list(id, day, value))), ]
do.call(rbind, by(mydata, list(mydata$id, mydata$day),
FUN=function(x) head(x, 1)))
Alternatively, look into the "data.table" package. Continuing with the ordered data.frame from above:
library(data.table)
DT <- data.table(mydata, key = "id,day")
DT[, head(.SD, 1), by = key(DT)]
# id day value
# 1: 1 1 10
# 2: 1 2 15
# 3: 1 3 20
# 4: 2 1 40
# 5: 2 2 30
# 6: 3 2 22
# 7: 3 3 24
# 8: 4 1 11
# 9: 4 2 11
# 10: 4 3 12
Or, starting from scratch, you can use data.table in the following way:
DT <- data.table(id, day, value, key = "id,day")
DT[, n := rank(value, ties.method="first"), by = key(DT)][n == 1]
And, by extension, in base R:
Ranks <- with(mydata, ave(value, id, day, FUN = function(x)
rank(x, ties.method="first")))
mydata[Ranks == 1, ]
Using data.table, assuming the mydata object has already been sorted in the way you require, another approach would be:
library(data.table)
mydata <- data.table(my.data)
mydata <- mydata[, .SD[1], by = .(id, day)]
Using dplyr with magrittr pipes:
library(dplyr)
mydata <- mydata %>%
group_by(id, day) %>%
slice(1) %>%
ungroup()
If you don't add ungroup() to the end dplyr's grouping structure will still be present and might mess up some of your subsequent functions.
I have a dataframe with 5 columns and a very large dataset. I want to sort by column 3. How do you sort everything after the first row? (When calling this function I want to end it with nrows)
Example output:
Original:
4
7
9
6
8
New:
4
9
8
7
6
Thanks!
If I'm correctly understanding what you want to do, this approach should work:
z <- data.frame(x1 = seq(10), x2 = rep(c(2,3), 5), x3 = seq(14, 23))
zsub <- z[2:nrow(z),]
zsub <- zsub[order(-zsub[,3]),]
znew <- rbind(z[1,], zsub)
Basically, snip off the rows you want to sort, sort them in descending order on column 3, then reattach the first row.
And here's a piped version using dplyr, so you don't clutter the workspace with extra objects:
library(dplyr)
z <- z %>%
slice(2:nrow(z)) %>%
arrange(-x3) %>%
rbind(slice(z, 1), .)
You might try this single line of code to modify the third column in your data frame df as described:
df[,3] <- c(df[1,3],sort(df[-1,3]))
df$x[-1] <- df$x[-1][order(df$x[-1], decreasing=T)]
# x
# 1 4
# 2 9
# 3 8
# 4 7
# 5 6
I'm quite new to R and this is the first time I dare to ask a question here.
I'm working with a dataset with likert scales and I want to row sum over different group of columns which share the first strings in their name.
Below I constructed a data frame of only 2 rows to illustrate the approach I followed, though I would like to receive feedback on how I can write a more efficient way of doing it.
df <- as.data.frame(rbind(rep(sample(1:5),4),rep(sample(1:5),4)))
var.names <- c("emp_1","emp_2","emp_3","emp_4","sat_1","sat_2"
,"sat_3","res_1","res_2","res_3","res_4","com_1",
"com_2","com_3","com_4","com_5","cap_1","cap_2",
"cap_3","cap_4")
names(df) <- var.names
So, what I did, was to use the grep function in order to be able to sum the rows of the specified variables that started with certain strings and store them in a new variable. But I have to write a new line of code for each variable.
df$emp_t <- rowSums(df[, grep("\\bemp.", names(df))])
df$sat_t <- rowSums(df[, grep("\\bsat.", names(df))])
df$res_t <- rowSums(df[, grep("\\bres.", names(df))])
df$com_t <- rowSums(df[, grep("\\bcom.", names(df))])
df$cap_t <- rowSums(df[, grep("\\bcap.", names(df))])
But there is a lot more variables in the dataset and I would like to know if there is a way to do this with only one line of code. For example, some way to group the variables that start with the same strings together and then apply the row function.
Thanks in advance!
One possible solution is to transpose df and calculate sums for the correct columns using base R rowsum function (using set.seed(123))
cbind(df, t(rowsum(t(df), sub("_.*", "_t", names(df)))))
# emp_1 emp_2 emp_3 emp_4 sat_1 sat_2 sat_3 res_1 res_2 res_3 res_4 com_1 com_2 com_3 com_4 com_5 cap_1 cap_2 cap_3 cap_4 cap_t
# 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 13
# 2 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 14
# com_t emp_t res_t sat_t
# 1 15 14 11 7
# 2 15 10 12 9
Agree with MrFlick that you may want to put your data in long format (see reshape2, tidyr), but to answer your question:
cbind(
df,
sapply(split.default(df, sub("_.*$", "_t", names(df))), rowSums)
)
Will do the trick
You'll be better off in the long run if you put your data into tidy format. The problem is that the data is in a wide rather than a long format. And the variable names, e.g., emp_1, are actually two separate pieces of data: the class of the person, and the person's ID number (or something like that). Here is a solution to your problem with dplyr and tidyr.
library(dplyr)
library(tidyr)
df %>%
gather(key, value) %>%
extract(key, c("class", "id"), "([[:alnum:]]+)_([[:alnum:]]+)") %>%
group_by(class) %>%
summarize(class_sum = sum(value))
First we convert the data frame from wide to long format with gather(). Then we split the values emp_1 into separate columns class and id with extract(). Finally we group by the class and sum the values in each class. Result:
Source: local data frame [5 x 2]
class class_sum
1 cap 26
2 com 30
3 emp 23
4 res 22
5 sat 19
Another potential solution is to use dplyr R rowwise function. https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-rowwise/
df %>%
rowwise() %>%
mutate(emp_sum = sum(c_across(starts_with("emp"))),
sat_sum = sum(c_across(starts_with("sat"))),
res_sum = sum(c_across(starts_with("res"))),
com_sum = sum(c_across(starts_with("com"))),
cap_sum = sum(c_across(starts_with("cap"))))
I am trying to calculate changes in weight between visits to chicks at different nests. This requires R to look up the nest code in the current row, find the previous time that nest was visited, and subtract the weight at the previous visit from the current visit. For the first visit to each nest, I would like to output the current weight (i.e. as though the weight at the previous, non-existent visit was zero).
My data is of the form:
Nest <- c(a,b,c,d,e,c,b,c)
Weight <- c(2,4,3,3,2,6,8,10)
df <- data.frame(Nest, Weight)
So the desired output here would be:
Change <- c(2,4,3,3,2,3,4,4)
I have achieved the desired output once, by subsetting to a single nest and using a for loop:
tmp <- subset(df, Nest == "a")
tmp$change <- tmp$Weight
for(x in 2:(length(tmp$Nest))){
tmp$change[x] <- tmp$Weight[(x)] - tmp$Weight[(x-1)]
}
but when I try to fit this into ddply
df2 <- ddply(df, "Nest", function(f) {
f$change <- f$Weight
for(x in 2:(length(f$Nest))){
f$change <- f$Weight[(x)] - f$Weight[(x-1)]
}
})
the output gives a blank data.frame (0 obs. of 0 variables).
Am I approaching this the right way but getting the code wrong? Or is there a better way to do it?
Thanks in advance!
Try this:
library(dplyr)
df %>% group_by(Nest) %>% mutate(Change = c(Weight[1], diff(Weight)))
or with just the base of R
transform(df, Change = ave(Weight, Nest, FUN = function(x) c(x[1], diff(x))))
Here is a data.table solution. With large data sets, this is likely to be faster.
library(data.table)
setDT(df)[,Change:=c(Weight[1],diff(Weight)),by=Nest]
df
# Nest Weight Change
# 1: a 2 2
# 2: b 4 4
# 3: c 3 3
# 4: d 3 3
# 5: e 2 2
# 6: c 6 3
# 7: b 8 4
# 8: c 10 4
Can the following code be made more "R like"?
Given data.frame inDF:
V1 V2 V3 V4
1 a ha 1;2;3 A
2 c hb 4 B
3 d hc 5;6 C
4 f hd 7 D
Inside df I want to
find all rows which for the "V3" column has multiple values
separated by ";"
then replicate the respective rows a number of times equal with the number of individual values in the "V3" column,
and then each replicated row receives in the "V3" column only one the initial values
Shortly, the output data.frame (= outDF) will look like:
V1 V2 V3 V4
1 a ha 1 A
1 a ha 2 A
1 a ha 3 A
2 c hb 4 B
3 d hc 5 C
3 d hc 6 C
4 f hd 7 D
So, if from inDF I want to get to outDF, I would write the following code:
#load inDF from csv file
inDF <- read.csv(file='example.csv', header=FALSE, sep=",", fill=TRUE)
#search in inDF, on the V3 column, all the cells with multiple values
rowlist <- grep(";", inDF[,3])
# create empty data.frame and add headers from "headDF"
xDF <- data.frame(matrix(0, nrow=0, ncol=4))
colnames(xDF)=colnames(inDF)
#take every row from the inDF data.frame which has multiple values in col3 and break it in several rows with only one value
for(i in rowlist[])
{
#count the number of individual values in one cell
value_nr <- str_count(inDF[i,3], ";"); value_nr <- value_nr+1
# replicate each row a number of times equal with its value number, and transform it to character
extracted_inDF <- inDF[rep(i, times=value_nr[]),]
extracted_inDF <- data.frame(lapply(extracted_inDF, as.character), stringsAsFactors=FALSE)
# split the values in V3 cell in individual values, place them in a list
value_ls <- str_split(inDF[i, 3], ";")
#initialize f, to use it later to increment both row number and element in the list of values
f = 1
# replace the multiple values with individual values
for(j in extracted_inDF[,3])
{
extracted_inDF[f,3] <- value_ls[[1]][as.integer(f)]
f <- f+1
}
#put all the "demultiplied" rows in xDF
xDF <- merge(extracted_inDF[], xDF[], all=TRUE)
}
# delete the rows with multiple values from the inDF
inDF <- inDF[-rowlist[],]
#create outDF
outDF <- merge(inDF, xDF, all=TRUE)
Could you please
I'm not sure that I'm one to speak about whether you are using R in the "right" or "wrong" way... I mostly just use it to answer questions on Stack Overflow. :-)
However, there are many ways in which your code could be improved. For starters, YES, you should try to become familiar with the predefined functions. They will often be much more efficient, and will make your code much more transparent to other users of the same language. Despite your concise description of what you wanted to achieve, and my knowing an answer virtually right away, I found your code daunting to look through.
I would break up your problem into two main pieces: (1) splitting up the data and (2) recombining it with your original dataset.
For part 1: You obviously know some of the functions you need--or at least the main one you need: strsplit. If you use strsplit, you'll see that it returns a list, but you need a simple vector. How do you get there? Look for unlist. The first part of your problem is now solved.
For part 2: You first need to determine how many times you need to replicate each row of your original dataset. For this, you drill through your list (for example, with l/s/v-apply) and count each item's length. I picked sapply since I knew it would create a vector that I could use with rep.
Then, if you've played with data.frames enough, particularly with extracting data, you would have come to realize that mydf[c(1, 1, 1, 2), ] will result in a data.frame where the first row is repeated two additional times. Knowing this, we can use the length calculation we just made to "expand" our original data.frame.
Finally, with that expanded data.frame, we just need to replace the relevant column with the unlisted values.
Here is the above in action. I've named your dataset "mydf":
V3 <- strsplit(mydf$V3, ";", fixed=TRUE)
sapply(V3, length) ## How many times to repeat each row?
# [1] 3 1 2 1
## ^^ Use that along with `[` to "expand" your data.frame
mydf2 <- mydf[rep(seq_along(V3), sapply(V3, length)), ]
mydf2$V3 <- unlist(V3)
mydf2
# V1 V2 V3 V4
# 1 a ha 1 A
# 1.1 a ha 2 A
# 1.2 a ha 3 A
# 2 c hb 4 B
# 3 d hc 5 C
# 3.1 d hc 6 C
# 4 f hd 7 D
To share some more options...
The "data.table" package can actually be pretty useful for something like this.
library(data.table)
DT <- data.table(mydf)
DT2 <- DT[, list(new = unlist(strsplit(as.character(V3), ";", fixed = TRUE))), by = V1]
merge(DT, DT2, by = "V1")
Alternatively, concat.split.multiple from my "splitstackshape" package pretty much does it in one step, but if you want your exact output, you'll need to drop the NA values and reorder the rows.
library(splitstackshape)
df2 <- concat.split.multiple(mydf, split.cols="V3", seps=";", direction="long")
df2 <- df2[complete.cases(df2), ] ## Optional, perhaps
df2[order(df2$V1), ] ## Optional, perhaps
In this case, you can use the split-apply-combine paradigm for reshaping the data.
You want to split inDF by its rows, since you want to operate on each row separately. I've used the split function here to split it up by row:
spl = split(inDF, 1:nrow(inDF))
spl is a list that contains a 1-row data frame for each row in inDF.
Next, you'll want to apply a function to transform the split up data into the final format you need. Here, I'll use the lapply function to transform the 1-row data frames, using strsplit to break up the variable V3 into its appropriate parts:
transformed = lapply(spl, function(x) {
data.frame(V1=x$V1, V2=x$V2, V3=strsplit(x$V3, ";")[[1]], V4=x$V4)
})
tranformed is now a list where the first element has a 3-row data frame, the third element has a 2-row data frame, and the second and fourth have 1-row data frames.
The last step is to combine this list together into outDF, using do.call with the rbind function. That has the same effect of calling rbind with all of the elements of the transformed list.
outDF = do.call(rbind, transformed)
This yields the desired final data frame:
outDF
# V1 V2 V3 V4
# 1.1 a ha 1 A
# 1.2 a ha 2 A
# 1.3 a ha 3 A
# 2 c hb 4 B
# 3.1 d hc 5 C
# 3.2 d hc 6 C
# 4 f hd 7 D