As part of a project, I am currently using R to analyze some data. I am currently stuck with the retrieving few values from the existing dataset which i have imported from a csv file.
The file looks like:
For my analysis, I wanted to create another column which is the subtraction of the current value of x and its previous value. But the first value of every unique i, x would be the same value as it is currently. I am new to R and i was trying various ways for sometime now but still not able to figure out a way to do so. Request your suggestions in the approach that I can follow to achieve this task.
Mydata structure
structure(list(t = 1:10, x = c(34450L, 34469L, 34470L, 34483L,
34488L, 34512L, 34530L, 34553L, 34575L, 34589L), y = c(268880.73342868,
268902.322359863, 268938.194698248, 268553.521856105, 269175.38273083,
268901.619719038, 268920.864512966, 269636.604121984, 270191.206593437,
269295.344751692), i = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L)), .Names = c("t", "x", "y", "i"), row.names = c(NA, 10L), class = "data.frame")
You can use the package data.table to obtain what you want:
library(data.table)
setDT(MyData)[, x_diff := c(x[1], diff(x)), by=i]
MyData
# t x i x_diff
# 1: 1 34287 1 34287
# 2: 2 34789 1 502
# 3: 3 34409 1 -380
# 4: 4 34883 1 474
# 5: 5 34941 1 58
# 6: 6 34045 2 34045
# 7: 7 34528 2 483
# 8: 8 34893 2 365
# 9: 9 34551 2 -342
# 10: 10 34457 2 -94
Data:
set.seed(123)
MyData <- data.frame(t=1:10, x=sample(34000:35000, 10, replace=T), i=rep(1:2, e=5))
You can use the diff() function. If you want to add a new column to your existing data frame, the diff function will return a vector x-1 length of your current data frame though. so in your case you can try this:
# if your data frame is called MyData
MyData$newX = c(NA,diff(MyData$x))
That should input an NA value as the first entry in your new column and the remaining values will be the difference between sequential values in your "x" column
UPDATE:
You can create a simple loop by subsetting through every unique instance of "i" and then calculating the difference between your x values
# initialize a new dataframe
newdf = NULL
values = unique(MyData$i)
for(i in 1:length(values)){
data1 = MyData[MyData$i = values[i],]
data1$newX = c(NA,diff(data1$x))
newdata = rbind(newdata,data1)
}
# and then if you want to overwrite newdf to your original dataframe
MyData = newdf
# remove some variables
rm(data1,newdf,values)
Related
I would like to sum a single column of data that was output from an sqldf function in R.
I have a csv. file that contains groupings of sites with a uniqueID and their associated areas. For example:
occurrenceID sarea
{0255531B-904F-4E2D-B81D-797A21165A2F} 0.30626786
{0255531B-904F-4E2D-B81D-797A21165A2F} 0.49235953
{0255531B-904F-4E2D-B81D-797A21165A2F} 0.03490536
{0255531B-904F-4E2D-B81D-797A21165A2F} 0.00001389
{175A4B1C-CA8C-49F6-9CD6-CED9187579DC} 0.0302389
{175A4B1C-CA8C-49F6-9CD6-CED9187579DC} 0.01360811
{1EC60400-0AD0-4DB5-B815-221C4123AE7F} 0.08412911
{1EC60400-0AD0-4DB5-B815-221C4123AE7F} 0.01852466
I used the code below in R to pull out the largest area from each grouping of unique ID's.
> MyData <- read.csv(file="sacandaga2.csv", header=TRUE, sep=",")
> sqldf("select max(sarea),occurrenceID from MyData group by occurrenceID")
This produced the following output:
max(sarea) occurrenceID
1 0.49235953 {0255531B-904F-4E2D-B81D-797A21165A2F}
2 0.03023890 {175A4B1C-CA8C-49F6-9CD6-CED9187579DC}
3 0.08412911 {1EC60400-0AD0-4DB5-B815-221C4123AE7F}
4 0.00548259 {2412E244-2E9A-4477-ACC6-1EB02503BE75}
5 0.00295924 {40450574-ABEB-48E3-9BE5-09B5AB65B465}
6 0.01403846 {473FB631-D398-46B7-8E85-E63540BDFF92}
7 0.00257519 {4BABDE22-E8E0-435E-B60D-0BB9A84E1489}
8 0.02158115 {5F616A33-B028-46B1-AD92-89EAC1660C41}
9 0.00191211 {70067496-25B6-4337-8C70-782143909EF9}
10 0.03049355 {7F858EBB-132E-483F-BA36-80CE889373F5}
11 0.03947298 {9A579565-57EC-4E46-95ED-79724FA6F2AB}
12 0.02464722 {A9010BA3-0FE1-40B1-96A7-21122261A003}
13 0.00136672 {AAD710BF-1539-4235-87F1-34B66CF90781}
14 0.01139146 {AB1286C3-DBE3-467B-99E1-AEEF88A1B5B2}
15 0.07954269 {BED0433A-7167-4184-A25F-B9DBD358AFFB}
16 0.08401067 {C4EF0F45-5BF7-4F7C-BED8-D6B2DB718CB2}
17 0.04289261 {C58AC2C6-BDBE-4FE5-BD51-D70BBDFB4DB5}
18 0.03151558 {D4230F9C-80E4-454A-9D5D-0E373C6DCD9A}
19 0.00403585 {DD76A03A-CFBF-41E9-A571-03DA707BEBDA}
20 0.00007336 {E20DE254-8A0F-40BE-90D2-D6B71880E2A8}
21 9.81847859 {F382D5A6-F385-426B-A543-F5DE13F94564}
22 0.00815881 {F9032905-074A-468F-B60E-26371CF480BB}
23 0.24717113 {F9E5DC3C-4602-4C80-B00B-2AF1D605A265}
Now I would like to sum all the values in the max(sarea) column. What is the best way to accomplish this?
Either do it in sqldf or R, or assign your existing result and do it in R:
# assign your original
grouped_sum = sqldf("select max(sarea),occurrenceID from MyData group by occurrenceID")
# and sum in R
sum(grouped_sum$`max(sarea)`)
# you might prefer to use a standard column name so you don't need backticks
grouped_sum = sqldf(
"select max(sarea) as max_sarea, occurrenceID
from MyData
group by occurrenceID"
)
sum(grouped_sum$max_sarea)
If the intention is to do this in a single 'sqldf' call, use with
library(sqldf)
sqldf("with tmpdat AS (
select max(sarea) as mxarea, occurrenceID
from MyData group by occurrenceID
) select sum(mxarea)
as smxarea from tmpdat")
# smxarea
#1 0.6067275
data
MyData <-
structure(list(occurrenceID = c("{0255531B-904F-4E2D-B81D-797A21165A2F}",
"{0255531B-904F-4E2D-B81D-797A21165A2F}", "{0255531B-904F-4E2D-B81D-797A21165A2F}",
"{0255531B-904F-4E2D-B81D-797A21165A2F}", "{175A4B1C-CA8C-49F6-9CD6-CED9187579DC}",
"{175A4B1C-CA8C-49F6-9CD6-CED9187579DC}", "{1EC60400-0AD0-4DB5-B815-221C4123AE7F}",
"{1EC60400-0AD0-4DB5-B815-221C4123AE7F}"), sarea = c(0.30626786,
0.49235953, 0.03490536, 1.389e-05, 0.0302389, 0.01360811, 0.08412911,
0.01852466)), class = "data.frame", row.names = c(NA, -8L))
You can do it by getting the sum of maximum values:
sqldf("select sum(max_sarea) as sum_of_max_sarea
from (select max(sarea) as max_sarea,
occurrenceID from Mydata group by occurrenceID)")
# sum_of_max_sarea
# 1 0.6067275
Data:
Mydata <- structure(list(occurrenceID = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L),
.Label = c("0255531B-904F-4E2D-B81D-797A21165A2F", "175A4B1C-CA8C-49F6-9CD6-CED9187579DC",
"1EC60400-0AD0-4DB5-B815-221C4123AE7F"), class = "factor"),
sarea = c(0.30626786, 0.49235953, 0.03490536, 1.389e-05, 0.0302389,
0.01360811, 0.08412911, 0.01852466)), class = "data.frame",
row.names = c(NA, -8L))
If DF is the last data frame shown in the question this sums the numeric column:
sqldf("select sum([max(sarea)]) as sum from DF")
## sum
## 1 11.07853
Note
We assume this data frame shown in reproducible form:
Lines <- "max(sarea) occurrenceID
1 0.49235953 {0255531B-904F-4E2D-B81D-797A21165A2F}
2 0.03023890 {175A4B1C-CA8C-49F6-9CD6-CED9187579DC}
3 0.08412911 {1EC60400-0AD0-4DB5-B815-221C4123AE7F}
4 0.00548259 {2412E244-2E9A-4477-ACC6-1EB02503BE75}
5 0.00295924 {40450574-ABEB-48E3-9BE5-09B5AB65B465}
6 0.01403846 {473FB631-D398-46B7-8E85-E63540BDFF92}
7 0.00257519 {4BABDE22-E8E0-435E-B60D-0BB9A84E1489}
8 0.02158115 {5F616A33-B028-46B1-AD92-89EAC1660C41}
9 0.00191211 {70067496-25B6-4337-8C70-782143909EF9}
10 0.03049355 {7F858EBB-132E-483F-BA36-80CE889373F5}
11 0.03947298 {9A579565-57EC-4E46-95ED-79724FA6F2AB}
12 0.02464722 {A9010BA3-0FE1-40B1-96A7-21122261A003}
13 0.00136672 {AAD710BF-1539-4235-87F1-34B66CF90781}
14 0.01139146 {AB1286C3-DBE3-467B-99E1-AEEF88A1B5B2}
15 0.07954269 {BED0433A-7167-4184-A25F-B9DBD358AFFB}
16 0.08401067 {C4EF0F45-5BF7-4F7C-BED8-D6B2DB718CB2}
17 0.04289261 {C58AC2C6-BDBE-4FE5-BD51-D70BBDFB4DB5}
18 0.03151558 {D4230F9C-80E4-454A-9D5D-0E373C6DCD9A}
19 0.00403585 {DD76A03A-CFBF-41E9-A571-03DA707BEBDA}
20 0.00007336 {E20DE254-8A0F-40BE-90D2-D6B71880E2A8}
21 9.81847859 {F382D5A6-F385-426B-A543-F5DE13F94564}
22 0.00815881 {F9032905-074A-468F-B60E-26371CF480BB}
23 0.24717113 {F9E5DC3C-4602-4C80-B00B-2AF1D605A265}"
DF <- read.table(text = Lines, check.names = FALSE)
I have a complex question whose answer is not anywhere.
Suppose that I have the following dataframe:
individual gen_check acc loss
1 nnn/nn/nn/nn 2 0.9889 0.0112
2 nnn/n/nn 2 0.7845 0.3451
3 nnn/nn/nn/nn 2 0.564 0.4231
What I want to do is to update the gen_check value of the first row when I filter by individual = "nnn/nn/nn/nn" and gen_check = 2, and I want to update the gen_check value to 3.
I've tried the following expression but it modifies me both first and third columns, but I want to update the first one.
fitness_calculations <- within(fitness_calculations, gen_check[individual == "nnn/nn/nn/nn" & gen_check == 2] <- 3)
We create the index along with the condition whether it is duplicated
i1 <- with(fitness_calculations, individual == "nnn/nn/nn/nn" & gen_check == 2)
i2 <- !duplicated(i1) & i1
fitness_calculations$gen_check[i2] <- 3
fitness_calculations
# individual gen_check acc loss
#1 nnn/nn/nn/nn 3 0.9889 0.0112
#2 nnn/n/nn 2 0.7845 0.3451
#3 nnn/nn/nn/nn 2 0.5640 0.4231
Or another option is to wrap with which and extract only the first index
i2 <- which(i1)[1]
fitness_calculations$gen_check[i2] <- 3
data
fitness_calculations <- structure(list(individual = c("nnn/nn/nn/nn",
"nnn/n/nn", "nnn/nn/nn/nn"
), gen_check = c(2L, 2L, 2L), acc = c(0.9889, 0.7845, 0.564),
loss = c(0.0112, 0.3451, 0.4231)), class = "data.frame", row.names = c("1",
"2", "3"))
Am at beginner stage of R programming, please help me in below issue.
I have different desc values assigned to the same sol attribute in different rows. I want to make all desc values of sol attribute in single row as mentioned below
My data is as follows:
sol desc
1 fry, toast
1 frt,grt,gty
1 ytr,uyt,ytr
6 hyt, ytr,oiu
4 hyg,hyu,loi
4 opu,yut,yut
I want the output as follows :
sol desc
1 fry,toast,frt,grt,gty,ytr,uyt,yir
6 hyt, ytr,oiu
4 hyg,hyu,loi,opu,yut,yut
Note: you can input any values in desc as per your convenience.
aggregate() is what you are looking for. Try this:
aggregate(desc ~ sol, data = df, paste, collapse = ",")
sol desc
1 1 fry, toast,frt,grt,gty,ytr,uyt,ytr
2 4 hyg,hyu,loi,opu,yut,yut
3 6 hyt, ytr,oiu
Data
df <- structure(list(sol = c(1L, 1L, 1L, 6L, 4L, 4L), desc = c("fry, toast",
"frt,grt,gty", "ytr,uyt,ytr", "hyt, ytr,oiu", "hyg,hyu,loi",
"opu,yut,yut")), .Names = c("sol", "desc"), class = "data.frame", row.names = c(NA,
-6L))
I am trying to figure out how to get the time between consecutive events when events are stored as a column of dates in a dataframe.
sampledf=structure(list(cust = c(1L, 1L, 1L, 1L), date = structure(c(9862,
9879, 10075, 10207), class = "Date")), .Names = c("cust", "date"
), row.names = c(NA, -4L), class = "data.frame")
I can get an answer with
as.numeric(rev(rev(difftime(c(sampledf$date[-1],0),sampledf$date))[-1]))
# [1] 17 196 132
but it is really ugly. Among other things, I only know how to exclude the first item in a vector, but not the last so I have to rev() twice to drop the last value.
Is there a better way?
By the way, I will use ddply to do this to a larger set of data for each cust id, so the solution would need to work with ddply.
library(plyr)
ddply(sampledf,
c("cust"),
summarize,
daysBetween = as.numeric(rev(rev(difftime(c(date[-1],0),date))[-1]))
)
Thank you!
Are you looking for this?
as.numeric(diff(sampledf$date))
# [1] 17 196 132
To remove the last element, use head:
head(as.numeric(diff(sampledf$date)), -1)
# [1] 17 196
require(plyr)
ddply(sampledf, .(cust), summarise, daysBetween = as.numeric(diff(date)))
# cust daysBetween
# 1 1 17
# 2 1 196
# 3 1 132
You can just use diff.
as.numeric(diff(sampledf$date))
To leave off the last, element, you can do:
[-length(vec)] #where `vec` is your vector
In this case I don't think you need to leave anything off though, because diff is already one element shorter:
test <- ddply(sampledf,
c("cust"),
summarize,
daysBetween = as.numeric(diff(sampledf$date)
))
test
# cust daysBetween
#1 1 17
#2 1 196
#3 1 132
A novice R user here. So i have a data set formated like:
Date Temp Month
1-Jan-90 10.56 1
2-Jan-90 11.11 1
3-Jan-90 10.56 1
4-Jan-90 -1.67 1
5-Jan-90 0.56 1
6-Jan-90 10.56 1
7-Jan-90 12.78 1
8-Jan-90 -1.11 1
9-Jan-90 4.44 1
10-Jan-90 10.00 1
In R syntax:
datacl <- structure(list(Date = structure(1:10, .Label = c("1990/01/01",
"1990/01/02", "1990/01/03", "1990/01/04", "1990/01/05", "1990/01/06",
"1990/01/07", "1990/01/08", "1990/01/09", "1990/01/10"), class = "factor"),
Temp = c(10.56, 11.11, 10.56, -1.67, 0.56, 10.56, 12.78,
-1.11, 4.44, 10), Month = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L)), .Names = c("Date", "Temp", "Month"), class = "data.frame", row.names = c(NA,
-10L))
i would like to subset the data for a particular month and apply a change factor to the temp then save the results. so i have something like
idx <- subset(datacl, Month == 1) # Index
results[idx[,2],1] = idx[,2]+change # change applied to only index values
but i keep getting an error like
Error in results[idx[, 2], 1] = idx[, 2] + change:
only 0's may be mixed with negative subscripts
Any help would be appreciated.
First, give the change factor a value:
change <- 1
Now, here is how to create an index:
# one approach to subsetting is to create a logical vector:
jan.idx <- datacl$Month == 1
# alternatively the which function returns numeric indices:
jan.idx2 <- which(datacl$Month == 1)
If you want just the subset of data from January,
jandata <- datacl[jan.idx,]
transformed.jandata <- transform(jandata, Temp = Temp + change)
To keep the entire data frame, but only add the change factor to Jan temps:
datacl$Temp[jan.idx] <- datacl$Temp[jan.idx] + change
First, note that subset does not produce an index, it produces a subset of your original dataframe containing all rows with Month == 1.
Then when you are doing idx[,2], you are selecting out the Temp column.
results[idx[,2],1] = idx[,2] + change
But then you are using these as an index into results, i.e. you're using them as row numbers. Row numbers can't be things like 10.56 or -1.11, hence your error. Also, you're selecting the first column of results which is Date and trying to add temperatures to it.
There are a few ways you can do this.
You can create a logical index that is TRUE for a row with Month == 1 and FALSE otherwise like so:
idx <- datac1$Month == 1
Then you can use that index to select the rows in datac1 you want to modify (this is what you were trying to do originally, I think):
datac1$Temp[idx] <- datac1$Temp[idx] + change # or 'results' instead of 'datac1'?
Note that datac1$Temp[idx] selects the Temp column of datac1 and the idx rows.
You could also do
datac1[idx,'Temp']
or
datac1[idx,2] # as Temp is the second column.
If you only want results to be the subset where Month == 1, try:
results <- subset(datac1, Month == 1)
results$Temp <- results$Temp + change
This is because results only contains the rows you are interested in, so there's no need to do subsetting.
Personally, I would use ifelse() and leverage the syntactic beauty that is within() for a nice one liner datacl <- within(datacl, Temp <- ifelse(Month == 1, Temp + change,Temp)). Well, I said one liner, but you'd need to define change somewhere else too.