I have an array object containing 3 columns and a ton of rows.
Example nonsensical data to show the format:
Name | Owner | Price
chair | roger | 50
table | roger | 150
sofa | bill | 500
I want to use the lm function to get stats about the price column. My problem is, my formular needs to compare the current value to the last value, skipping the very first row completely.
Right now I have
lm(My_Function(Price, 5)~., data=myArray)
This allows me to do whatever logic I need with the price values. But I need to get the Price, and also the price of the previous row, in My_Function, to allow for some comparison logic.
How could I do that?
My code should look sort of like this
lm(My_Function(Price, previousPrice, 5)~., data=myArray)
So I need two things:
How to get the previous value (or any other arbitrary index's value
during the logic, in relation to the current one)
How to skip the
very first row, without losing its data of course (since it will be
the "previous" data for the next row)
Here's code which implements Robert Tan's suggestion:
# Make example data
X = data.frame("Price" = rnorm(10),
"Owner" = sample(c("roger", "bill"), 10, replace = T))
# Lag the price variable
library(Hmisc)
X$previousPrice = Lag(X$Price, shift = 1) #shift gives number of lags
X #Note first value for previousPrice is NA
# Run linear model. Note the first row will be ignored from the model as the "lagging" generates an NA
f = lm(Price ~ previousPrice, data = X)
summary(f)
Note that this approach will solve both of your questions: (1) is addressed by the lag function; (2) happens automatically because lm() will omit the first row because X$previousPrice has an NA for the first value.
If the above approach doesn't solve your problems and you still need to explicitly call My_Function() on an object with the first row removed, you could do the following:
My_Function = function(x1, x2) {x1 - x2} #Just for illustration
X2 = X[complete.cases(X), ] #make copy of X with first row removed (NB you could use `X[-1, ]` but complete.cases() will remove *all* rows with NAs)
lm(My_Function(X2$Price, X2$previousPrice) ~ ., data = X2)
Related
I have a data frame that looks like this:
df <- data.frame(Set = c("A","A","A","B","B","B","B"), Values=c(1,1,2,1,1,2,2))
I want to collapse the data frame so I have one row for A and one for B. I want the Values column for those two rows to reflect the most common Values from the whole dataset.
I could do this as described here (How to find the statistical mode?), but notably when there's a tie (two values that each occur once, therefore no "true" mode) it simply takes the first value.
I'd prefer to use my own hierarchy to determine which value is selected in the case of a tie.
Create a data frame that defines the hierarchy, and assigns each possibility a numeric score.
hi <- data.frame(Poss = unique(df$Set), Nums =c(105,104))
In this case, A gets a numerical value of 105, B gets a numerical score of 104 (so A would be preferred over B in the case of a tie).
Join the hierarchy to the original data frame.
require(dplyr)
matched <- left_join(df, hi, by = c("Set"="Poss"))
Then, add a frequency column to your original data frame that lists the number of times each unique Set-Value combination occurs.
setDT(matched)[, freq := .N, by = c("Set", "Value")]
Now that those frequencies have been recorded, we only need row of each Set-Value combo, so get rid of the rest.
multiplied <- distinct(matched, Set, Value, .keep_all = TRUE)
Now, multiply frequency by the numeric scores.
multiplied$mult <- multiplied$Nums * multiplied$freq
Lastly, sort by Set first (ascending), then mult (descending), and use distinct() to take the highest numerical score for each Value within each Set.
check <- multiplied[with(multiplied, order(Set, -mult)), ]
final <- distinct(check, Set, .keep_all = TRUE)
This works because multiple instances of B (numerical score = 104) will be added together (3 instances would give B a total score in the mult column of 312) but whenever A and B occur at the same frequency, A will win out (105 > 104, 210 > 208, etc.).
If using different numeric scores than the ones provided here, make sure they are spaced out enough for the dataset at hand. For example, using 2 for A and 1 for B doesn't work because it requires 3 instances of B to trump A, instead of only 2. Likewise, if you anticipate large differences in the frequencies of A and B, use 1005 and 1004, since A will eventually catch up to B with the scores I used above (200 * 104 is less than 199 * 205).
I am creating some pre-load files that need to be cleaned, by ensuring the sum of the 2 columns are equal to the total sum column. The data entry was done manually by RA's and therefore the data is prone to error. My problem is ascertaining that the data is clean and if there is an error, the easiest way to identify the columns that don't add up by returning the ID number.
This is my data
df1 <- data.frame(
id = c(1,2,3,4,5,6,7),
male = c(2,4,2,6,3,4,5),
female = c(3,6,4,9,2,4,1),
Total = c(5,10,7,15,6,8,7)
)
The code am looking for is suppossed to compare if male+female=Total in each row, and ONLY returns an error where there is disagreement. In my data above, i would expect an error like like sum of male and female in 3 rows with ID 3,5 and 7, are not equal to the total.
You could also do something more fancy like this one liner:
df1$id[apply(df1[c('male','female')], 1, sum) != df1$Total]
which will give you just the ids (Aziz's answer works great too)
You can use:
mismatch_rows = which(df1$male + df1$female != df1$Total)
To get the indices of the rows that don't match. If you want the actual values, you can simply use:
df1[mismatch_rows,]
I am trying to do a couple things here in R. I have a large dataset. I need to find the mean of a column SI.x, which I have done, then break up the data and find the SI.x mean for each of the subsets, which I have also done.
But then I need to subtract the total mean SI.x (which I've called meangen0a as it's the mean of the generation I'm looking at) from each of the subsetted means. I'd like a way to save the subsetted means as a vector, subtract meangen0a from each of these, and save the result as another vector, as I will need to do more vector math later.
Here's what I've done so far:
I got the mean SI.x of the generation I'm looking at (which I called gen0a):
meangen0a <- mean(gen0a$SI.x)
This worked fine.
I split up the generation by treatment (a control and four others) and only used those that were selected for (which was designated by a 1 in the Select column).
gen0ameans <- with(gen0a[gen0a$Select == 1,], aggregate(SI.x, by=list(Generation, SelectTreatment), mean))
colnames(gen0amean) <- c("Generation, "Treatment", "S")
This gave me a table with the generation (all 0a), the five treatments, and what their respective SI.x means were. This is what I wanted.
Now I want to subtract the total mean meangen0a from each of the five treatment means in the gen0ameans table. I tried doing this:
S0a <- lapply(gen0ameans$S, FUN=function(S) S-meangen0a)
and it gave me the correct numbers, but not in vector format. I need it to be in a vector of some sort because I will later need to subset the next generation and subtract 0a's means from the next generation's. When I tried to save S0a as a vector or matrix, it wasn't giving me a single row or column of the means like I'd like.
Any help would be appreciated. Thanks!
Edit - The mean of gen0a is -0.07267818.
The gen0ameans table looks like:
Generation
-----------------
0a
0a
0a
0a
0a
Treatment
-----------------
Control
Down1
Down2
Up1
Up2
S
-----------------
-0.07205068
-0.08288528
-0.08146745
-0.06296805
-0.06401943
When doing the S0a command from #3 above, it gives me:
[[1]]
[1] 0.0006274983
[[2]]
[1] -0.0102071
[[3]]
[1] -0.008789275
[[4]]
[1] 0.009710126
[[5]]
[1] 0.008658747
We can do this in tidyverse
library(tidyverse)
gen0a %>%
mutate(Meanval = mean(SI.x)) %>%
filter(Select == 1) %>%
group_by(Generation, SelectTreatment) %>%
mutate(NewMean = mean(SI.x) - Meanval)
This is probably simple, but Im new to R and it doesn't work like GrADs so I;ve been searching high and low for examples but to no avail..
I have two sets of data. Data A (1997) and Data B (2000)
Data A has 35 headings (apples, orange, grape etc). 200 observations.
Data B has 35 headings (apples, orange, grape, etc). 200 observations.
The only difference between the two datasets is the year.
So i would like to correlate the two dataset i.e. 200 data under Apples (1997) vs 200 data under Apples (2000). So 1 heading should give me only 1 value.
I've converted all the header names to V1,V2,V3...
So now I need to do this:
x<-1
while(x<35) {
new(x)=cor(1997$V(x),2000$V(x))
print(new(x))
}
and then i get this error:
Error in pptn26$V(x) : attempt to apply non-function.
Any advise is highly appreciated!
Your error comes directly from using parentheses where R isn't expecting them. You'll get the same type of error if you do 1(x). 1 is not a function, so if you put it right next to parentheses with no white space between, you're attempting to apply a non function.
I'm also a bit surprised at how you are managing to get all the way to that error, before running into several others, but I suppose that has something to do with when R evaluates what...
Here's how to get the behavior you're looking for:
mapply(cor, A, B)
# provided A is the name of your 1997 data frame and B the 2000
Here's an example with simulated data:
set.seed(123)
A <- data.frame(x = 1:10, y = sample(10), z = rnorm(10))
B <- data.frame(x = 4:13, y = sample(10), z = rnorm(10))
mapply(cor, A, B)
# x y z
# 1.0000000 0.1393939 -0.2402058
In its typical usage, mapply takes an n-ary function and n objects that provide the n arguments for that function. Here the n-ary function is cor, and the objects are A, and B, each a data frame. A data frame is structured as a list of vectors, the columns of the data frame. So mapply will loop along your columns for you, making 35 calls to cor, each time with the next column of both A and B.
If you have managed to figure out how to name your data frames 1997 and 2000, kudos. It's not easy to do that. It's also going to cause you headaches. You'll want to have a syntactically valid name for your data frame(s). That means they should start with a letter (or a dot, but really a letter). See the R FAQ for the details.
So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.
This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT
The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.