I currently have a table that looks like this:
Date Variable Value
1995-10-01 X 50
1995-10-01 Y 60
1995-08-03 X 70
1995-08-03 Y 90
And want to reshape it so that it looks like this:
Date X Y
1995-10-01 50 60
1995-08-03 70 90
This is easily doable in R using the cast function from the reshape package with the command df <- cast(df, ... ~ variable). I have two questions:
1) Can this form of dataset modification be done using a calculated field with an R script?
2) Is there a native way for such modification to be done in Tableau?
Any help would be much appreciated.
This is your data as it would be setup by default:
Step 1 Image
All, you need to do is to move the variable field up to the Columns section:
Step 2 Image
Related
I have a data frame with 25 weeks of observations per animal and 20 animals in total. I am trying to write a function that calculates a linear equation between 2 points each time and do that for the 25 weeks and the 20 animals.
I want to use a general form of the equation so I can calculate values al any point. In the function, Week=t, Weight=d.
I can't figure out how to make this work. I don't think the loop is working using each row of the data frame as the index for the function. My data frame named growth looks something like this:
Week Weight Animal
1 50 1
2 60 1
n=25
1 80 2
2 90 2
.
.
20
for (i in growth$Week){
eq<- function(t){
d = growth$BW.Kg
t = growth$Week
(d[i+1]-d[i])/(t[i+1]-t[i])*(t-t[i])+d[i]
return(eq)
}
}
eq(3)
OK, so I think there are a few points of confusion here. The first is writing a function inside a for loop. What is happening is that you are re-writing the function over and over, and also your function doesn't save the values of your equation anywhere. Secondly, you are passing t as your argument but the expecting t to follow the for loop with the i value. Finally, you say that you want this to be done for each animal, but the animal value is not shown in your code.
So it's a little bit hard to see what you are trying to achieve here.
Based on your information above, I've rewritten your function into something that will provide a result for your equation.
library(tidyverse)
growth <- tibble(week = 1:5,
animal = 1,
weight = c(50,52,55,54,57))
eq <- function(d,t,i){
z <- (d[i+1]-d[i])/(t[i+1]-t[i])*(t-t[i])+d[i]
return(z)
}
test_result <- eq(growth$weight,growth$week,3)
Results:
[1] 57 56 55 54 53
Is that the kind of result you were expecting? Or did you want just a single result per week per animal? Could you provide a working example of a formula that would produce a single desired result (i.e. a result for animal 1 on week 1)?
Here is an example of my dataset. I want to calculate bin average based on time (i.e., ts) every 10 seconds. Could you please provide some hints so that I can carry on?
In my case, I want to average time (ts) and Var in every 10 seconds. For example, I will get an averaged value of Var and ts from 0 to 10 seconds; I will get another averaged value of Var and ts from 11 to 20 seconds, etc.
df = data.frame(ts = seq(1,100,by=0.5), Var = runif(199,1, 10))
Any functions or libraries in R can I use for this task?
There are many ways to calculate a binned average: with base aggregate,by, with the packages dplyr, data.table, probably with zoo and surely other timeseries packages...
library(dplyr)
df %>%
group_by(interval = round(df$ts/10)*10) %>%
summarize(Var_mean = mean(Var))
# A tibble: 11 x 2
interval Var_mean
<dbl> <dbl>
1 0 4.561653
2 10 6.544980
3 20 6.110336
4 30 4.288523
5 40 5.339249
6 50 6.811147
7 60 6.180795
8 70 4.920476
9 80 5.486937
10 90 5.284871
11 100 5.917074
That's the dplyr approach, see how it and data.table let you name the intermediate variables, which keeps code clean and legible.
Assuming df in the question, convert to a zoo object and then aggregate.
The second argument of aggregate.zoo is a vector the same length as the time vector giving the new times that each original time is to be mapped to. The third argument is applied to all time series values whose times have been mapped to the same value. This mapping could be done in various ways but here we have chosen to map times (0, 10] to 10, (10, 20] to 20, etc. by using 10 * ceiling(time(z) / 10).
In light of some of the other comments in the answers let me point out that in contrast to using a data frame there is significant simplification here, firstly because the data has been reduced to one dimension (vs. 2 in a data.frame), secondly because it is more conducive to the whole object approach whereas with data frames one needs to continually pick apart the object and work on those parts and thirdly because one now has all the facilities of zoo to manipulate the time series such as numerous NA removal schemes, rolling functions, overloaded arithmetic operators, n-way merges, simple access to classic, lattice and ggplot2 graphics, design which emphasizes consistency with base R making it easy to learn and extensive documentation including 5 vignettes plus help files with numerous examples and likely very few bugs given the 14 years of development and widespread use.
library(zoo)
z <- read.zoo(df)
z10 <- aggregate(z, 10 * ceiling(time(z) / 10), mean)
giving:
> z10
10 20 30 40 50 60 70 80
5.629926 6.571754 5.519487 5.641534 5.309415 5.793066 4.890348 5.509859
90 100
4.539044 5.480596
(Note that the data in the question is not reproducible because it used random numbers without set.seed so if you try to repeat the above you won't get an identical answer.)
Now we could plot it, say, using any of these:
plot(z10)
library(lattice)
xyplot(z10)
library(ggplot2)
autoplot(z10)
In general, I agree with #smci, the dplyr and data.table approach is the best here. Let me elaborate a bit further.
# the dplyr way
library(dplyr)
df %>%
group_by(interval = ceiling(seq_along(ts)/20)) %>%
summarize(variable_mean = mean(Var))
# the data.table way
library(data.table)
dt <- data.table(df)
dt[,list(Var_mean = mean(Var)),
by = list(interval = ceiling(seq_along(dt$ts)/20))]
I would not go to the traditional time series solutions like ts, zoo or xts here. Their methods are more suitable to handle regular frequencies and frequency like monthly or quarterly data. Apart from ts they can handle irregular frequencies and also high frequency data, but many methods such as the print methods don't work well or least do not give you an advantage over data.table or data.frame.
As long as you're just aggregating and grouping both data.table and dplyr are also likely faster in terms of performance. Guess data.table has the edge over dplyr in terms of speed, but you would have benchmark / profile that, e.g. using microbenchmark. So if you're not working with a classic R time series format anyway, there's no reason to go to these for aggregating.
I'm working on a rather lengthy shared R program which processes client data and references things like the name of the time variables supplied by each client (which obviously changes at almost every client submission).
What I wanted to do is to set the name of (say) a timeseries variable to WEEK and be able to reference timeseries throughout the code so that I only need to change the one section of code right at the top:
TOP OF CODE
timeseries <- "WEEK"
EXAMPLE MID CODE
summary_transposed_no_time = summary_transposed_no_missing
summary_transposed_no_time$timeseries <- NULL
I have found that this approach does work for things like sqldf steps as the below is working just fine. Ideally I want to use this approach across both R logic and SQL logic as the program is very lengthy and a lot of it is written in SQL which I would love to avoid re-writing:
dataset <- "client_a_data"
response <- "SALE"
timeseries <- "WEEK"
region <- "POSTAL_DIST"
summary <- sqldf(paste("SELECT",timeseries,
",",region,
",sum(",response,") AS", response,
"FROM", "dataset",
"GROUP BY",timeseries,"," ,region,
"ORDER BY",timeseries,"," ,region
)
)
I think I see what you're trying to achieve, but let me know if I'm off track...
One way I can see to do this would be to build a search for the appropriate column early in your script, and use the returned value from then on to refer to column.
df <- data.frame( data = rnorm( 20, 1, 1 ), day = seq_len( 20 ) )
df$week <- ((df$day - 1) %/% 7) + 1
Now we can specify your timeseries variable as any of the columns in the frame:
timeseries <- "week"
Then, somewhere in our script, have something like this to extract a reference for the column:
timeColumn <- match( timeseries, names( df ) )
Which now allows you to refer to that column as many times as you like in your script:
df[, timeColumn]
Any time you change that "week" value to, say "day", the rest of your script will now change to refer to that instead.
Just a note, if you do go this route, be careful to either not move columns around (making your reference value stop working correctly) or have the match call run each time you want to refer to the column (this would allow you to move columns around if you need to).
You can refer to any column by name directly. Variables response, timeseries and region are as defined in the question.
# generate some data
client_a_data <- data.frame(SALE=100:104, WEEK=1:5, POSTAL_DIST=60000:60004)
# read in data
dataset <- ... # whatever code you use to upload the client_a_data
# here:
dataset <- "client_a_data"
dataset <- get(dataset)
dataset
SALE WEEK POSTAL_DIST
1 100 1 60000
2 101 2 60001
3 102 3 60002
4 103 4 60003
5 104 5 60004
# refer to any column by its pre-defined name
dataset[, timeseries]
[1] 1 2 3 4 5
dataset[, c(response, region)]
SALE POSTAL_DIST
1 100 60000
2 101 60001
3 102 60002
4 103 60003
5 104 60004
So your specific line that would delete the WEEK column should read:
summary_transposed_no_time[, timeseries] <- NULL
Or you might wish to rename the pertaining columns at the beginning of your code to whatever text appears throughout.
colnames(dataset)[match(c(timeseries, response, region), colnames(dataset))] <- c("timeseries", "response", "region")
This seems like it should be simple, but I have been struggling for a while to solve. I am trying to extract the value of variable Z - given the values of two categorical variables X & Y.
** BUT, I want to do this for all combinations of X & Y **
So, for any given value, this is easy - I can get Z by using the following code (Assume the data frame is called df):
df[df$X == 1 & df$Y == 2, ]$Z
But, I would like to use this to build a cross-reference table.
The following example will make this easy to understand.
Here's a simplified version of the data frame as it comes in:
Person ID Question Number Response
1 10 YES
1 20 NO
1 30 YES
2 10 YES
2 20 MAYBE
2 30 YES
3 10 YES
3 30 NO
4 20 NO
4 30 MAYBE
I want to be able to take this data and make a cross-reference data.frame, like so:
[row names are the levels of "Person ID" and col names are the levels of "Question Number"]
[10] [20] [30]
[1] YES NO YES
[2] YES MAYBE YES
[3] YES N/A NO
[4] N/A NO MAYBE
I have tried the "table" function gives me summary statistics, frequency counts. So, if I use the following:
table(df$Person.Id, df$Question.Num)
I get the right row and column headings, but the values are frequency counts. Since this is a cross-reference table, I need that to be the value for df$Response instead of the frequency count.
As I said before, I can manually find every value of df$Response using the following code
df[df$Person.ID == "1" & df$Question.Num == "20", ]$Response
But, I cannot manage to stitch this together into a data.frame. I tried to use nested for loops, but couldnt get it to work. I could get all the value out, but no way to stitch everything into a cross-reference table, as described above.
Just a background note: this is a necessary preparatory step so I can minimize logit linear model.
Based on the suggestion by Metrics, I did the following:
install.packages("reshape")
library("reshape")
cast(df, Person.Id~Question.Num, value = "Response")
That last part is key. The value = "Response" tells the cast function what variable to use to fill in the table.
This is a fantastic package. You can find more information on it here:
http://www.statmethods.net/management/reshape.html
and, the original paper published in the Journal of Statistical Software:
http://www.jstatsoft.org/v21/i12/paper
Thanks for the tips!
I never tried the tidyr package, because the reshape package worked so well. Perhaps, it is just as easy with that package. I leave it to the community to figure that out.
Thanks!
I have an input file like this
V1 V2 V3 V4.............V60
11 22 33 44.............89
21 98 22 33.............09
33 44 55 78.............20
The above file has more than 3000 rows with 60 columns in each row.
When I try using density(data, kernel="gaussian", bw=15) at my r prompt, it is generating an error saying
Error in density.default(data) : argument 'x' must be numeric
But, when I try density(data$V1, kernel="gaussian", bw=15), it works fine.
I was wondering if there is a single command to calculate the density of entire file instead of doing it for every single column 60 times.
you might be looking for sapply or apply.
you can use
apply(myDataName, 2, density, kernel="gaussian", bw=15)
if your columns are factors instead of numeric, you will need to convert those first.
Most likely your data object is a data frame (this is the default when reading data using tools like read.table and read.csv).
If you want to process each column (create a seperate density plot for each column) then you can use the lapply function.
If you want one single density based on all the data (the columns don't mean anything), then you can use the unlist function to convert it all to one big vector. Better may be to use the scan function instead of read.table to load the data into a vector to begin with and skip the data frame all together.