R: Grouping levels of a factor across multiple files - r

I'm new to R and struggling to group multiple levels of a factor prior to calculating means. This problem is complicated by the fact that I am doing this over hundreds of files that have variable levels of factors that need to be grouped. I see from previous posts how to address this grouping issue for single levels using levels (), but my data is too variable for this method.
Basically, I'd like to calculate both individual and then an overall mean for multiple levels of a factor. For example, I would like to calculate the mean for each species for each of the following factors present in the column Status: Crypt1, Crypt2, Crypt3, Native, Intro, and then also the overall mean for Crypt species (includes Crypt1, Crypt2, and Crypt3, but not Native or Intro). However, a species either has multiple levels of Crypt (variable, and up to Crypt8), or has Native and Intro, and means for all species at each of these levels are ultimately averaged into the same summary sheet.
For example:
Species Status Value
A Crypt1 5
A Crypt1 6
A Crypt2 4
A Crypt2 8
A Crypt3 10
A Crypt3 50
B Native 2
B Native 9
B Intro 9
B Intro 10
I was thinking that I could use the first letter of each factor to group the Crypt factors together, but I am struggling to target the first letter because they are factors, not strings, and I am not sure how to convert between them. I'm ultimately calculating the means using aggregate(), and I can get individual means for each factor, but not for the grouped factors.
Any ideas would be much appreciated, thanks!

For the individual means:
# assuming your data is in data.frame = df
require(plyr)
df.1 <- ddply(df, .(Species, Status), summarise, ind.m.Value = mean(Value))
> df.1
# Species Status ind.m.Value
# 1 A Crypt1 5.5
# 2 A Crypt2 6.0
# 3 A Crypt3 30.0
# 4 B Intro 9.5
# 5 B Native 5.5
For the overall mean, the idea is to remove the numbers present at the end of every entry in Status using sub/gsub.
df.1$Status2 <- gsub("[0-9]+$", "", df.1$Status)
df.2 <- ddply(df.1, .(Species, Status2), summarise, oall.m.Value = mean(ind.m.Value))
> df.2
# Species Status2 oall.m.Value
# 1 A Crypt 13.83333
# 2 B Intro 9.50000
# 3 B Native 5.50000
Is this what you're expecting?

Here's an alternative. Conceptually, it is the same as Arun's answer, but it sticks to functions in base R, and in a way, keeps your workspace and original data somewhat tidy.
I'm assuming we're starting with a data.frame named "temp" and that we want to create two new data.frames, "T1" and "T2" for individual and grouped means.
# Verify that you don't have T1 and T2 in your workspace
ls(pattern = "T[1|2]")
# character(0)
# Use `with` to generate T1 (individual means)
# and to generate T2 (group means)
with(temp, {
T1 <<- aggregate(Value ~ Species + Status, temp, mean)
temp$Status <- gsub("\\d+$", "", Status)
T2 <<- aggregate(Value ~ Species + Status, temp, mean)
})
# Now they're there!
ls(pattern = "T[1|2]")
# [1] "T1" "T2"
Notice that we used <<- to assign the results from within with to the global environment. Not everyone likes using that, but I think it is OK in this particular case. Here is what "T1" and "T2" look like.
T1
# Species Status Value
# 1 A Crypt1 5.5
# 2 A Crypt2 6.0
# 3 A Crypt3 30.0
# 4 B Intro 9.5
# 5 B Native 5.5
T2
# Species Status Value
# 1 A Crypt 13.83333
# 2 B Intro 9.50000
# 3 B Native 5.50000
Looking back at the with command, it might have seemed like we had changed the value of the "Status" column. However, that was only within the environment created by using with. Your original data.frame is the same as it was when you started.
temp
# Species Status Value
# 1 A Crypt1 5
# 2 A Crypt1 6
# 3 A Crypt2 4
# 4 A Crypt2 8
# 5 A Crypt3 10
# 6 A Crypt3 50
# 7 B Native 2
# 8 B Native 9
# 9 B Intro 9
# 10 B Intro 10

Related

Newbie working on Horse Racing Database using R

I'm new to the group and to R language.
I've written some code (below) that achieves the desired result.
However, i'm aware that i'm reproducing lines of the same code which would surely be more efficiently coded using a for loop.
Also, there will be races with large numbers of horses so I really need to be able to run a for loop that runs through each horse.
ie. num_runners = NROW(my_new_data)
my_new_data holds data on horses previous races.
DaH is a numeric rating that is attached to each of a horse's previous runs with DaH1 being the most recent and DaH6 is six races back.
Code, a character, signifies the type of race that the horse competed in. ie. Flat, Fences.
I have played with using for loops, ie. for(i in 1:6) without success.
Since I am assigning to a new horse each time I would hope something such as the following would work:
horse(i) = c(my_new_data$DaH1[i],my_new_data$DaH2[i],my_new_data$DaH3[i],my_new_data$DaH4[i],my_new_data$DaH5[i],my_new_data$DaH6[i])
But I know that horse(i) is not allowed.
Would my best strategy be to pre-define a dataframe of size: 6 rows and 6 columns
and use 2 for loops to populate [row][column]? Something like:
final_data[i,j]
Here is the code I am presently using which creates the dataframe racetest:
horse1 = c(my_new_data$DaH1[1],my_new_data$DaH2[1],my_new_data$DaH3[1],my_new_data$DaH4[1],my_new_data$DaH5[1],my_new_data$DaH6[1])
horse2 = c(my_new_data$DaH1[2],my_new_data$DaH2[2],my_new_data$DaH3[2],my_new_data$DaH4[2],my_new_data$DaH5[2],my_new_data$DaH6[2])
horse3 = c(my_new_data$DaH1[3],my_new_data$DaH2[3],my_new_data$DaH3[3],my_new_data$DaH4[3],my_new_data$DaH5[3],my_new_data$DaH6[3])
horse4 = c(my_new_data$DaH1[4],my_new_data$DaH2[4],my_new_data$DaH3[4],my_new_data$DaH4[4],my_new_data$DaH5[4],my_new_data$DaH6[4])
horse5 = c(my_new_data$DaH1[5],my_new_data$DaH2[5],my_new_data$DaH3[5],my_new_data$DaH4[5],my_new_data$DaH5[5],my_new_data$DaH6[5])
horse6 = c(my_new_data$DaH1[6],my_new_data$DaH2[6],my_new_data$DaH3[6],my_new_data$DaH4[6],my_new_data$DaH5[6],my_new_data$DaH6[6])
horse1.code = c(my_new_data$Code1[1],my_new_data$Code2[1],my_new_data$Code3[1],my_new_data$Code4[1],my_new_data$Code5[1],my_new_data$Code6[1])
horse2.code = c(my_new_data$Code1[2],my_new_data$Code2[2],my_new_data$Code3[2],my_new_data$Code4[2],my_new_data$Code5[2],my_new_data$Code6[2])
horse3.code = c(my_new_data$Code1[3],my_new_data$Code2[3],my_new_data$Code3[3],my_new_data$Code4[3],my_new_data$Code5[3],my_new_data$Code6[3])
horse4.code = c(my_new_data$Code1[4],my_new_data$Code2[4],my_new_data$Code3[4],my_new_data$Code4[4],my_new_data$Code5[4],my_new_data$Code6[4])
horse5.code = c(my_new_data$Code1[5],my_new_data$Code2[5],my_new_data$Code3[5],my_new_data$Code4[5],my_new_data$Code5[5],my_new_data$Code6[5])
horse6.code = c(my_new_data$Code1[6],my_new_data$Code2[6],my_new_data$Code3[6],my_new_data$Code4[6],my_new_data$Code5[6],my_new_data$Code6[6])
racetest = data.frame(horse1,horse1.code,horse2,horse2.code, horse3, horse3.code,
horse4,horse4.code,horse5,horse5.code, horse6, horse6.code)
Thanks in advance for any help that can be offered!
Graham
using loops in R is usually not the correct approach. Still I will give you something which might work.
There are two possible approaches I see here, I will address the simpler one:
if columns are ordered such that column 1:6 are named DaH1 to DaH6 and columns 7: 12 are the ones named horse1.code etc... in this case:
library(magrittr)
temp<- cbind(my_new_data[,1:6] %>% t,
my_new_data[,7:12]%>% t)
Odd = seq(1,12,2)
my_new_data[ , Odd] = temp[,1:6]
my new_data[ , -Odd] = temp[,7:12]
#cleanup
rm(temp,Odd)
my_new_data should now contain your desired output. Before you run this, make sure your data is backed up inside another object as this is untested code.
Actually we want to reshape the wide format of the data in a different wide format. But first let's look at your desired for loop approach, to understand what's going on.
Using a loop
For the loop we'll need two variables with sequences i and j.
## initialize matrix with dimnames
racetest <- matrix(NA, 3, 6,
dimnames=list(c("DaH1", "DaH2", "DaH3"),
c("horse1", "horse1.code", "horse2", "horse2.code",
"horse3", "horse3.code")))
## loop
for (i in 0:2) {
for (j in 1:3) {
racetest[j, 1:2+2*i] <- unlist(my_new_data[i+1, c(1, 4)])
}
}
# horse1 horse1.code horse2 horse2.code horse3 horse3.code
# DaH1 1 1 2 2 3 3
# DaH2 1 1 2 2 3 3
# DaH3 1 1 2 2 3 3
Often for loops are discouraged in R, because they might be slow and doesn't use the vectorized features of the R language. Moreover they can also be tricky to program.
Transposing column sets
We also could do a different approach. Actually we want to transpose the DaH* and Code* column sets (identifiable using grep) and bring them in the appropriate order using substring of names, with nchar as first character.
rownames(my_new_data) <- paste0("horse.", seq(nrow(my_new_data)))
rr <- data.frame(DaH=t(my_new_data[, grep("DaH", names(my_new_data))]),
Code=t(my_new_data[, grep("Code", names(my_new_data))]))
rr <- rr[order(substring(names(rr), nchar(names(rr))))]
rr
# DaH.horse.1 Code.horse.1 DaH.horse.2 Code.horse.2 DaH.horse.3 Code.horse.3
# DaH1 1 1 2 2 3 3
# DaH2 1 1 2 2 3 3
# DaH3 1 1 2 2 3 3
Reshaping data
Last but not least, we actually want to reshape the data. For this we give the data set an ID variable.
my_new_data <- transform(my_new_data, horse=1:nrow(my_new_data))
At first, we reshape the data into "long" format, using the new ID variable horse and put the two varying column sets into a list.
rr1 <- reshape(my_new_data, idvar="horse", varying=list(1:3, 4:6), direction="long", sep="",
v.names=c("DaH", "Code"))
rr1
# horse time DaH Code
# 1.1 1 1 1 1
# 2.1 2 1 2 2
# 3.1 3 1 3 3
# 1.2 1 2 1 1
# 2.2 2 2 2 2
# 3.2 3 2 3 3
# 1.3 1 3 1 1
# 2.3 2 3 2 2
# 3.3 3 3 3 3
Then, in order to get the desired wide format, what we want is to swap idvar and timevar, where our new idvar is "time" and our new timevar is "horse".
reshape(rr1, timevar="horse", idvar="time", direction= "wide")
# time DaH.1 Code.1 DaH.2 Code.2 DaH.3 Code.3
# 1.1 1 1 1 2 2 3 3
# 1.2 2 1 1 2 2 3 3
# 1.3 3 1 1 2 2 3 3
Benchmark
The benchmark reveals that of these three approaches transposing of the matrices is fastest, while the 'for' loop is actually by far the slowest.
# Unit: microseconds
# expr min lq mean median uq max neval cld
# forloop 7191.038 7373.5890 8381.8036 7576.678 7980.4320 46677.324 100 c
# transpose 620.748 656.0845 707.7248 692.953 733.1365 944.773 100 a
# reshape 2791.710 2858.6830 3013.8372 2958.825 3118.4125 3871.960 100 b
Toy data:
my_new_data <- data.frame(DaH1=1:3, DaH2=1:3, DaH3=1:3, Code1=1:3, Code2=1:3, Code3=1:3)

How to transpose a long data frame every n rows

I have a data frame like this:
x=data.frame(type = c('a','b','c','a','b','a','b','c'),
value=c(5,2,3,2,10,6,7,8))
every item has attributes a, b, c while some records may be missing records, i.e. only have a and b
The desired output is
y=data.frame(item=c(1,2,3), a=c(5,2,6), b=c(2,10,7), c=c(3,NA,8))
How can I transform x to y? Thanks
We can use dcast
library(data.table)
out <- dcast(setDT(x), rowid(type) ~ type, value.var = 'value')
setnames(out, 'type', 'item')
out
# item a b c
#1: 1 5 2 3
#2: 2 2 10 8
#3: 3 6 7 NA
Create a grouping vector g assuming each occurrence of a starts a new group, use tapply to create a table tab and coerce that to a data frame. No packages are used.
g <- cumsum(x$type == "a")
tab <- with(x, tapply(value, list(g, type), c))
as.data.frame(tab)
giving:
a b c
1 5 2 3
2 2 10 NA
3 6 7 8
An alternate definition of the grouping vector which is slightly more complex but would be needed if some groups have a missing is the following. It assumes that x lists the type values in order of their levels within group so that if a level is less than the prior level it must be the start of a new group.
g <- cumsum(c(-1, diff(as.numeric(x$type))) < 0)
Note that ultimately there must be some restriction on missingness; otherwise, the problem is ambiguous. For example if one group can have b and c missing and then next group can have a missing then whether b and c in the second group actually form a second group or are part of the first group is not determinable.

R: change one value every row in big dataframe

I just started working with R for my master thesis and up to now all my calculations worked out as I read a lot of questions and answers here (and it's a lot of trial and error, but thats ok).
Now i need to process a more sophisticated code and i can't find a way to do this.
Thats the situation: I have multiple sub-data-sets with a lot of entries, but they are all structured in the same way. In one of them (50000 entries) I want to change only one value every row. The new value should be the amount of the existing entry plus a few values from another sub-data-set (140000 entries) where the 'ID'-variable is the same.
As this is the third day I'm trying to solve this, I already found and tested for and apply but both are running for hours (canceled after three hours).
Here is an example of one of my attempts (with for):
for (i in 1:50000) {
Entry_ID <- Sub02[i,4]
SUM_Entries <- sum(Sub03$Source==Entry_ID)
Entries_w_ID <- subset(Sub03, grepl(Entry_ID, Sub03$Source)) # The Entry_ID/Source is a character
Value1 <- as.numeric(Entries_w_ID$VAL1)
SUM_Value1 <- sum(Value1)
Value2 <- as.numeric(Entries_w_ID$VAL2)
SUM_Value2 <- sum(Value2)
OLD_Val1 <- Sub02[i,13]
OLD_Val <- as.numeric(OLD_Val1)
NEW_Val <- SUM_Entries + SUM_Value1 + SUM_Value2 + OLD_Val
Sub02[i,13] <- NEW_Val
}
I know this might be a silly code, but thats the way I tried it as a beginner. I would be very grateful if someone could help me out with this so I can get along with my thesis.
Thank you!
Here's an example of my data-structure:
Text VAL0 Source ID VAL1 VAL2 VAL3 VAL4 VAL5 VAL6 VAL7 VAL8 VAL9
XXX 12 456335667806925_1075080942599058 10153901516433434_10153902087098434 4 1 0 0 4 9 4 6 8
ABC 8 456335667806925_1057045047735981 10153677787178434_10153677793613434 6 7 1 1 5 3 6 8 11
DEF 8 456747267806925_2357045047735981 45653677787178434_94153677793613434 5 8 2 1 5 4 1 1 9
The output I expect is an updated value 'VAL9' in every row.
From what I understood so far, you need 2 things:
sum up some values in one dataset
add them to another dataset, using an ID variable
Besides what #yoland already contributed, I would suggest to break it down in two separate tasks. Consider these two datasets:
a = data.frame(x = 1:2, id = letters[1:2], stringsAsFactors = FALSE)
a
# x id
# 1 1 a
# 2 2 b
b = data.frame(values = as.character(1:4), otherid = letters[1:2],
stringsAsFactors = FALSE)
sapply(b, class)
# values otherid
# "character" "character"
Values is character now, we need to convert it to numeric:
b$values = as.numeric(b$values)
sapply(b, class)
# values otherid
# "numeric" "character"
Then sum up the values in b (grouped by otherid):
library(dplyr)
b = group_by(b, otherid)
b = summarise(b, sum_values = sum(values))
b
# otherid sum_values
# <chr> <dbl>
# 1 a 4
# 2 b 6
Then join it with a - note that identifiers are specified in c():
ab = left_join(a, b, by = c("id" = "otherid"))
ab
# x id sum_values
# 1 1 a 4
# 2 2 b 6
We can then add the result of the sum from b to the variable x in a:
ab$total = ab$x + ab$sum_values
ab
# x id sum_values total
# 1 1 a 4 5
# 2 2 b 6 8
(Updated.)
From what I understand you want to create a new variable that uses information from two different data sets indexed by the same ID. The easiest way to do this is probably to join the data sets together (if you need to safe memory, just join the columns you need). I found dplyr's join functions very handy for these cases (explained neatly here) Once you joined the data sets into one, it should be easy to create the new columns you need. e.g.: df$new <- df$old1 + df$old2

Use value in new variable name

I am trying to build a for loop which will step through each site, for that site calculate frequencies of a response, and put those results in a new data frame. Then after the loop I want to be able to combine all of the site data frames so it will look something like:
Site Genus Freq
1 A 50
1 B 30
1 C 20
2 A 70
2 B 10
2 C 20
But to do this I need my names (of vectors, dataframes) to change each time through the loop. I think I can do this using the SiteNum variable, but how do I insert it into new variable names? The way I tried (below) treats it like part of the string, doesn't insert the value for the name.
I feel like what I want to use is a placeholder %, but I don't know how to do that with variable names.
> SiteNum <- 1
> for (Site in CoralSites){
> Csub_SiteNum <- subset(dfrmC, Site==CoralSites[SiteNum])
> CGrfreq_SiteNum <- numeric(length(CoralGenera))
> for (Genus in CoralGenera){
> CGrfreq_SiteNum[GenusNum] <- mean(dfrmC$Genus == CoralGenera[GenusNum])*100
> GenusNum <- GenusNum + 1
> }
> names(CGrfreq_SiteNum) <- c(CoralGenera)
> Site_SiteNum <- c(Site)
> CG_SiteNum <- data.frame(CoralGenera,CGrfreq_SiteNum,Site_SiteNum)
> SiteNum <- SiteNum + 1
> }
Your question as stated asks how you can create a bunch of variables, e.g. CGrfreq_1, CGrfreq_2, ..., where the name of the variable indicates the site number that it corresponds to (1, 2, ...). While you can do such a thing with functions like assign, it is not good practice for a few reasons:
It makes your code to generate the variables more complicated because it will be littered with calls to assign and get and paste0.
It makes your data more difficult to manipulate afterwards -- you'll need to (either manually or programmatically) identify all the variables of a certain type, grab their values with get or mget, and then do something with them.
Instead, you'll find it easier to work with other R functions that will perform the aggregation for you. In this case you're looking to generate for each Site/Genus pairing the percentage of data points at the site with the particular genus value. This can be done in a few lines of code with the aggregate function:
# Sample data:
(dat <- data.frame(Site=c(rep(1, 5), rep(2, 5)), Genus=c(rep("A", 3), rep("B", 6), "A")))
# Site Genus
# 1 1 A
# 2 1 A
# 3 1 A
# 4 1 B
# 5 1 B
# 6 2 B
# 7 2 B
# 8 2 B
# 9 2 B
# 10 2 A
# Calculate frequencies
dat$Freq <- 1
res <- aggregate(Freq~Genus+Site, data=dat, sum)
res$Freq <- 100 * res$Freq / table(dat$Site)[as.character(res$Site)]
res
# Genus Site Freq
# 1 A 1 60
# 2 B 1 40
# 3 A 2 20
# 4 B 2 80

Reordering data and making a re-ordered list in r

I previously asked this question:"
How do I re-order a dataframe with multiple species in r. Each species has a different number of observations and I need the final dataframe to be ordered in descending order with the species with most observations listed first. In this example the final dataframe should look list specie B first then Species C and finally specie A."
colA= c("C","C","C","B","B","B","B","A","A")
colB= c(1.1,2.1,3.1,4.1,5.1,6.1,7.1,8.1,9.1)
colC= c(-1.2,-2.1,-3.1,-4.1,-5.1,-6.1,-7.1,-8.1,-9.1)
df= data.frame (spp=colA, latitude=colB, longitude=colC)
df
I received a great answer which worked well:
# add a column counting the number of rows in each species
df <- transform(df, n = ave(latitude ,spp, FUN = length))
# order by this new column
dfordered <- df[order(df$n),decreasing = TRUE]
but now I am stuck again in making an object"species" which has the ordered species names in it.
Right now I have:
species <- levels(df$spp)
this command puts everything back in alphabetical order but I need the object to be ordered by "n" (the number of records). Any suggestions. Thanks in advance!
Cheers,
Israel
This is as easy as using unique. I coerce to character so as not to get confused between levels and actual values.
unique(as.character(dfordered$spp))
To get here from the original data, use table
table(df$spp)
## A B C
## 2 4 3
You can sort this
sppCount <- table(df$spp)
names(sort(sppCount, decreasing = TRUE))
# [1] "B" "C" "A"
If you are primarily interested in relative abundance, you may also want your spp factor to be ordered by frequency. In that case, use reorder() to arrange its levels as desired, and then sort the data.frame so that the species with the most observations come first.
df <- transform(df, spp = reorder(spp, spp, length))
df[order(df$spp, decreasing=TRUE),]
# spp latitude longitude
# 4 B 4.1 -4.1
# 5 B 5.1 -5.1
# 6 B 6.1 -6.1
# 7 B 7.1 -7.1
# 1 C 1.1 -1.2
# 2 C 2.1 -2.1
# 3 C 3.1 -3.1
# 8 A 8.1 -8.1
# 9 A 9.1 -9.1
## To see one advantage of reordering the factor levels
barplot(table(df$spp))

Resources