I have been searching for this for a while, but haven't been able to find a clear answer so far. Probably have been looking for the wrong terms, but maybe somebody here can quickly help me. The question is kind of basic.
Sample data set:
set <- structure(list(VarName = structure(c(1L, 5L, 4L, 2L, 3L),
.Label = c("Apple/Blue/Nice",
"Apple/Blue/Ugly", "Apple/Pink/Ugly", "Kiwi/Blue/Ugly", "Pear/Blue/Ugly"
), class = "factor"), Color = structure(c(1L, 1L, 1L, 1L, 2L), .Label = c("Blue",
"Pink"), class = "factor"), Qty = c(45L, 34L, 46L, 21L, 38L)), .Names = c("VarName",
"Color", "Qty"), class = "data.frame", row.names = c(NA, -5L))
This gives a data set like:
set
VarName Color Qty
1 Apple/Blue/Nice Blue 45
2 Pear/Blue/Ugly Blue 34
3 Kiwi/Blue/Ugly Blue 46
4 Apple/Blue/Ugly Blue 21
5 Apple/Pink/Ugly Pink 38
What I would like to do is fairly straight forward. I would like to sum (or averages or stdev) the Qty column. But, also I would like to do the same operation under the following conditions:
VarName includes "Apple"
VarName includes "Ugly"
Color equals "Blue"
Anybody that can give me a quick introduction on how to perform this kind of calculations?
I am aware that some of it can be done by the aggregate() function, e.g.:
aggregate(set[3], FUN=sum, by=set[2])[1,2]
However, I believe that there is a more straight forward way of doing this then this. Are there some filters that can be added to functions like sum()?
The easiest way to to split up your VarName column, then subsetting becomes very easy. So, lets create an object were varName has been separated:
##There must(?) be a better way than this. Anyone?
new_set = t(as.data.frame(sapply(as.character(set$VarName), strsplit, "/")))
Brief explanation:
We use as.character because set$VarName is a factor
sapply takes each value in turn and applies strplit
The strsplit function splits up the elements
We convert to a data frame
Transpose to get the correct rotation
Next,
##Convert to a data frame
new_set = as.data.frame(new_set)
##Make nice rownames - not actually needed
rownames(new_set) = 1:nrow(new_set)
##Add in the Qty column
new_set$Qty = set$Qty
This gives
R> new_set
V1 V2 V3 Qty
1 Apple Blue Nice 45
2 Pear Blue Ugly 34
3 Kiwi Blue Ugly 46
4 Apple Blue Ugly 21
5 Apple Pink Ugly 38
Now all the operations are as standard. For example,
##Add up all blue Qtys
sum(new_set[new_set$V2 == "Blue",]$Qty)
[1] 146
##Average of Blue and Ugly Qtys
mean(new_set[new_set$V2 == "Blue" & new_set$V3 == "Ugly",]$Qty)
[1] 33.67
Once it's in the correct form, you can use ddply which does every you want (and more)
library(plyr)
##Split the data frame up by V1 and take the mean of Qty
ddply(new_set, .(V1), summarise, m = mean(Qty))
##Split the data frame up by V1 & V2 and take the mean of Qty
ddply(new_set, .(V1, V2), summarise, m = mean(Qty))
Is this what you're looking for?
# sum for those including 'Apple'
apple <- set[grep('Apple', set[, 'VarName']), ]
aggregate(apple[3], FUN=sum, by=apple[2])
Color Qty
1 Blue 66
2 Pink 38
# sum for those including 'Ugly'
ugly <- set[grep('Ugly', set[, 'VarName']), ]
aggregate(ugly[3], FUN=sum, by=ugly[2])
Color Qty
1 Blue 101
2 Pink 38
# sum for Color==Blue
sum(set[set[, 'Color']=='Blue', 3])
[1] 146
The last sum could be done by using subset
sum(subset(set, Color=='Blue')[,3])
Related
This is an example of data:
exp_data <- structure(list(Seq = c("AAAARVDS", "AAAARVDSSSAL",
"AAAARVDSRASDQ"), Change = structure(c(19L, 20L, 13L), .Label = c("",
"C[+58]", "C[+58], F[+1152]", "C[+58], F[+1152], L[+12], M[+12]",
"C[+58], L[+2909]", "L[+12]", "L[+370]", "L[+504]", "M[+12]",
"M[+1283]", "M[+1457]", "M[+1491]", "M[+16]", "M[+16], Y[+1013]",
"M[+16], Y[+1152]", "M[+16], Y[+762]", "M[+371]", "M[+386], Y[+12]",
"M[+486], W[+12]", "Y[+12]", "Y[+1240]", "Y[+1502]", "Y[+1988]",
"Y[+2918]"), class = "factor"), `Mass` = c(1869.943,
1048.459, 707.346), Size = structure(c(2L, 2L, 2L), .Label = c("Matt",
"Greg",
"Kieran"
), class = "factor"), `Number` = c(2L, 2L, 2L)), row.names = c(244L,
392L, 396L), class = "data.frame")
I would like to bring your attention to column name Change as this is the one which I would like to use for filtering. We have three rows here and I would like to keep only first one because there is a change bigger than 100 for specific letter. I would like to keep all of the rows which contain the change of letter greater than +100. It might be a situatation that there is up to 4-5 letters in change column but if there is at least one with modification of at least +100 I would like to keep this row.
Do you have any simple solution for that ?
Expected output:
Seq Change Mass Size Number
244 AAAARVDS M[+486], W[+12] 1869.943 Greg 2
Not entirely sure I understood your problem statement correctly, but perhaps something like this
library(dplyr)
library(stringr)
exp_data %>% filter(str_detect(Change, "\\d{3}"))
# Seq Change Mass Size Number
#1 AAAARVDS M[+486], W[+12] 1869.943 Greg 2
Or the same in base R
exp_data[grep("\\d{3}", exp_data$Change), ]
# Seq Change Mass Size Number
#1 AAAARVDS M[+486], W[+12] 1869.943 Greg 2
The idea is to use a regular expression to keep only those rows where Change contains at least one three-digit expression.
You can use str_extract_all from the stringr package
library(stringr)
data.table solution
library(data.table)
setDT(exp_data)
exp_data[, max := max(as.numeric(str_extract_all(Change, "[[:digit:]]+")[[1]])), by = Seq]
exp_data[max > 100, ]
Seq Change Mass Size Number max
1: AAAARVDS M[+486], W[+12] 1869.9 Greg 2 486
dplyr solution
library(dplyr)
exp_data %>%
group_by(Seq) %>%
filter(max(as.numeric(str_extract_all(Change, "[[:digit:]]+")[[1]])) > 100)
# A tibble: 1 x 5
# Groups: Seq [1]
Seq Change Mass Size Number
<chr> <fct> <dbl> <fct> <int>
1 AAAARVDS M[+486], W[+12] 1870. Greg 2
I have a dataset with two columns containing the following: an indicator number and a hashcode
The only problem is that the columns have the same name, but the value can switch columns.
Now I want to merge the columns and keep the number (I don't care about the hashcode)
I saw this question: Merge two columns into one in r
and I tried the coalesce() function, but that is only for having NA values. Which I don't have. I looked at the unite function, but according to the cheat sheet documentation documentation here that doesn't what I'm looking for
My next try was the filter_at and other filter functions from the dplyr package Documentation here
But that only leaves 150 data points while at the start I have 61k data points.
Code of filter_at I tried:
data <- filter_at(data,vars("hk","hk_1"),all_vars(.>0))
I assumed that a #-string shall not be greater than 0, which seems to be true, but it removes more than intented.
I would like to keep hk or hk_1 value which is a number. The other one (the hash) can be removed. Then I want a new column which only contains those numbers.
Sample data
My data looks like this:
HK|HK1
190|#SP0839
190|#SP0340
178|#SP2949
#SP8390|177
#SP2240|212
What I would like to see:
HK
190
190
178
177
212
I hope this provides an insight into the data. There are more columns like description, etc which makes that 190 at the start are not doubles.
We can replace all the values that start with "#" to NA and then use coalesce to select non-NA value between HK and HK1.
library(dplyr)
df %>%
mutate_all(~as.character(replace(., grepl("^#", .), NA))) %>%
mutate(HK = coalesce(HK, HK1)) %>%
select(HK)
# HK
#1 190
#2 190
#3 178
#4 177
#5 212
data
df <- structure(list(HK = structure(c(4L, 4L, 3L, 2L, 1L), .Label = c("#SP2240",
"#SP8390", "178", "190"), class = "factor"), HK1 = structure(c(2L,
1L, 3L, 4L, 5L), .Label = c("#SP0340", "#SP0839", "#SP2949",
"177", "212"), class = "factor")), class = "data.frame", row.names = c(NA, -5L))
I need to manipulate the raw data (csv) to a wide format so that I can analyze in R or SPSS.
It looks something like this:
1,age,30
1,race,black
1,scale_total,35
2,age,20
2,race,white
2,scale_total,99
Ideally it would look like:
ID,age,race,scale_total, etc
1, 30, black, 35
2, 20, white, 99
I added values to the top row of the raw data (ID, Question, Response) and tried the cast function but I believe this aggregated data instead of just transforming it:
data_mod <- cast(raw.data2, ID~Question, value="Response")
Aggregation requires fun.aggregate: length used as default
You could use tidyr...
library(tidyr)
df<-read.csv(text="1,age,30
1,race,black
1,scale_total,35
2,age,20
2,race,white
2,scale_total,99", header=FALSE, stringsAsFactors=FALSE)
df %>% spread(key=V2,value=V3)
V1 age race scale_total
1 1 30 black 35
2 2 20 white 99
We need a sequence column to be created to take care of the duplicate rows which by default results in aggregation to length
library(data.table)
dcast(setDT(df1), ID + rowid(Question) ~ Question, value.var = 'Response')
NOTE: The example data clearly works (giving expected output) without using the sequence column.
dcast(setDT(df1), ID ~ Question)
# ID age race scale_total
#1: 1 30 black 35
#2: 2 20 white 99
So, this is a case when applied on the full dataset with duplicate rows
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), Question = c("age",
"race", "scale_total", "age", "race", "scale_total"), Response = c("30",
"black ", "35", "20", "white", "99")), class = "data.frame",
row.names = c(NA, -6L))
For SPSS:
data list list/ID (f5) Question Response (2a20).
begin data
1 "age" "30"
1 "race" "black"
1 "scale_total" "35"
2 "age" "20"
2 "race" "white"
2 "scale_total" "99"
end data.
casestovars /id=id /index=question.
Note that the resulting variables age and scale_total will be string variables - you'll have to turn them into numbers before further transformations:
alter type age scale_total (f8).
I am trying to figure out how to get the time between consecutive events when events are stored as a column of dates in a dataframe.
sampledf=structure(list(cust = c(1L, 1L, 1L, 1L), date = structure(c(9862,
9879, 10075, 10207), class = "Date")), .Names = c("cust", "date"
), row.names = c(NA, -4L), class = "data.frame")
I can get an answer with
as.numeric(rev(rev(difftime(c(sampledf$date[-1],0),sampledf$date))[-1]))
# [1] 17 196 132
but it is really ugly. Among other things, I only know how to exclude the first item in a vector, but not the last so I have to rev() twice to drop the last value.
Is there a better way?
By the way, I will use ddply to do this to a larger set of data for each cust id, so the solution would need to work with ddply.
library(plyr)
ddply(sampledf,
c("cust"),
summarize,
daysBetween = as.numeric(rev(rev(difftime(c(date[-1],0),date))[-1]))
)
Thank you!
Are you looking for this?
as.numeric(diff(sampledf$date))
# [1] 17 196 132
To remove the last element, use head:
head(as.numeric(diff(sampledf$date)), -1)
# [1] 17 196
require(plyr)
ddply(sampledf, .(cust), summarise, daysBetween = as.numeric(diff(date)))
# cust daysBetween
# 1 1 17
# 2 1 196
# 3 1 132
You can just use diff.
as.numeric(diff(sampledf$date))
To leave off the last, element, you can do:
[-length(vec)] #where `vec` is your vector
In this case I don't think you need to leave anything off though, because diff is already one element shorter:
test <- ddply(sampledf,
c("cust"),
summarize,
daysBetween = as.numeric(diff(sampledf$date)
))
test
# cust daysBetween
#1 1 17
#2 1 196
#3 1 132
I am trying to produce a series of box plots in R that is grouped by 2 factors. I've managed to make the plot, but I cannot get the boxes to order in the correct direction.
My data farm I am using looks like this:
Nitrogen Species Treatment
2 G L
3 R M
4 G H
4 B L
2 B M
1 G H
I tried:
boxplot(mydata$Nitrogen~mydata$Species*mydata$Treatment)
this ordered the boxes alphabetically (first three were the "High" treatments, then within those three they were ordered by species name alphabetically).
I want the box plot ordered Low>Medium>High then within each of those groups G>R>B for the species.
So i tried using a factor in the formula:
f = ordered(interaction(mydata$Treatment, mydata$Species),
levels = c("L.G","L.R","L.B","M.G","M.R","M.B","H.G","H.R","H.B")
then:
boxplot(mydata$Nitrogen~f)
however the boxes are still shoeing up in the same order. The labels are now different, but the boxes have not moved.
I have pulled out each set of data and plotted them all together individually:
lg = mydata[mydata$Treatment="L" & mydata$Species="G", "Nitrogen"]
mg = mydata[mydata$Treatment="M" & mydata$Species="G", "Nitrogen"]
hg = mydata[mydata$Treatment="H" & mydata$Species="G", "Nitrogen"]
etc ..
boxplot(lg, lr, lb, mg, mr, mb, hg, hr, hb)
This gives what i want, but I would prefer to do this in a more elegant way, so I don't have to pull each one out individually for larger data sets.
Loadable data:
mydata <-
structure(list(Nitrogen = c(2L, 3L, 4L, 4L, 2L, 1L), Species = structure(c(2L,
3L, 2L, 1L, 1L, 2L), .Label = c("B", "G", "R"), class = "factor"),
Treatment = structure(c(2L, 3L, 1L, 2L, 3L, 1L), .Label = c("H",
"L", "M"), class = "factor")), .Names = c("Nitrogen", "Species",
"Treatment"), class = "data.frame", row.names = c(NA, -6L))
The following commands will create the ordering you need by rebuilding the Treatment and Species factors, with explicit manual ordering of the levels:
mydata$Treatment = factor(mydata$Treatment,c("L","M","H"))
mydata$Species = factor(mydata$Species,c("G","R","B"))
edit 1 : oops I had set it to HML instead of LMH. fixing.
edit 2 : what factor(X,Y) does:
If you run factor(X,Y) on an existing factor, it uses the ordering of the values in Y to enumerate the values present in the factor X. Here's some examples with your data.
> mydata$Treatment
[1] L M H L M H
Levels: H L M
> as.integer(mydata$Treatment)
[1] 2 3 1 2 3 1
> factor(mydata$Treatment,c("L","M","H"))
[1] L M H L M H <-- not changed
Levels: L M H <-- changed
> as.integer(factor(mydata$Treatment,c("L","M","H")))
[1] 1 2 3 1 2 3 <-- changed
It does NOT change what the factor looks like at first glance, but it does change how the data is stored.
What's important here is that many plot functions will plot the lowest enumeration leftmost, followed by the next, etc.
If you create factors simply using factor(X) then usually the enumeration is based upon the alphabetical order of the factor levels, (e.g. "H","L","M"). If your labels have a conventional ordering different from alphabetical (i.e. "H","M","L"), this can make your graphs seems strange.
At first glance, it may seem like the problem is due to the ordering of data in the data frame - i.e. if only we could place all "H" at the top and "L" at the bottom, then it would work. It doesn't. But if you want your labels to appear in the same order as the first occurrence in the data, you can use this form:
mydata$Treatment = factor(mydata$Treatment, unique(mydata$Treatment))
This earlier StackOverflow question shows how to reorder a boxplot based on a numerical value; what you need here is probably just a switch from factor to the related type ordered. But it is hard say as we do not have your data and you didn't provide a reproducible example.
Edit Using the dataset you posted in variable md and relying on the solution I pointed to earlier, we get
R> md$Species <- ordered(md$Species, levels=c("G", "R", "B"))
R> md$Treatment <- ordered(md$Treatment, levels=c("L", "M", "H"))
R> with(md, boxplot(Nitrogen ~ Species * Treatment))
which creates the chart you were looking to create.
This is also equivalent to the other solution presented here.