A novice R user here. So i have a data set formated like:
Date Temp Month
1-Jan-90 10.56 1
2-Jan-90 11.11 1
3-Jan-90 10.56 1
4-Jan-90 -1.67 1
5-Jan-90 0.56 1
6-Jan-90 10.56 1
7-Jan-90 12.78 1
8-Jan-90 -1.11 1
9-Jan-90 4.44 1
10-Jan-90 10.00 1
In R syntax:
datacl <- structure(list(Date = structure(1:10, .Label = c("1990/01/01",
"1990/01/02", "1990/01/03", "1990/01/04", "1990/01/05", "1990/01/06",
"1990/01/07", "1990/01/08", "1990/01/09", "1990/01/10"), class = "factor"),
Temp = c(10.56, 11.11, 10.56, -1.67, 0.56, 10.56, 12.78,
-1.11, 4.44, 10), Month = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L)), .Names = c("Date", "Temp", "Month"), class = "data.frame", row.names = c(NA,
-10L))
i would like to subset the data for a particular month and apply a change factor to the temp then save the results. so i have something like
idx <- subset(datacl, Month == 1) # Index
results[idx[,2],1] = idx[,2]+change # change applied to only index values
but i keep getting an error like
Error in results[idx[, 2], 1] = idx[, 2] + change:
only 0's may be mixed with negative subscripts
Any help would be appreciated.
First, give the change factor a value:
change <- 1
Now, here is how to create an index:
# one approach to subsetting is to create a logical vector:
jan.idx <- datacl$Month == 1
# alternatively the which function returns numeric indices:
jan.idx2 <- which(datacl$Month == 1)
If you want just the subset of data from January,
jandata <- datacl[jan.idx,]
transformed.jandata <- transform(jandata, Temp = Temp + change)
To keep the entire data frame, but only add the change factor to Jan temps:
datacl$Temp[jan.idx] <- datacl$Temp[jan.idx] + change
First, note that subset does not produce an index, it produces a subset of your original dataframe containing all rows with Month == 1.
Then when you are doing idx[,2], you are selecting out the Temp column.
results[idx[,2],1] = idx[,2] + change
But then you are using these as an index into results, i.e. you're using them as row numbers. Row numbers can't be things like 10.56 or -1.11, hence your error. Also, you're selecting the first column of results which is Date and trying to add temperatures to it.
There are a few ways you can do this.
You can create a logical index that is TRUE for a row with Month == 1 and FALSE otherwise like so:
idx <- datac1$Month == 1
Then you can use that index to select the rows in datac1 you want to modify (this is what you were trying to do originally, I think):
datac1$Temp[idx] <- datac1$Temp[idx] + change # or 'results' instead of 'datac1'?
Note that datac1$Temp[idx] selects the Temp column of datac1 and the idx rows.
You could also do
datac1[idx,'Temp']
or
datac1[idx,2] # as Temp is the second column.
If you only want results to be the subset where Month == 1, try:
results <- subset(datac1, Month == 1)
results$Temp <- results$Temp + change
This is because results only contains the rows you are interested in, so there's no need to do subsetting.
Personally, I would use ifelse() and leverage the syntactic beauty that is within() for a nice one liner datacl <- within(datacl, Temp <- ifelse(Month == 1, Temp + change,Temp)). Well, I said one liner, but you'd need to define change somewhere else too.
Related
I need to manipulate the raw data (csv) to a wide format so that I can analyze in R or SPSS.
It looks something like this:
1,age,30
1,race,black
1,scale_total,35
2,age,20
2,race,white
2,scale_total,99
Ideally it would look like:
ID,age,race,scale_total, etc
1, 30, black, 35
2, 20, white, 99
I added values to the top row of the raw data (ID, Question, Response) and tried the cast function but I believe this aggregated data instead of just transforming it:
data_mod <- cast(raw.data2, ID~Question, value="Response")
Aggregation requires fun.aggregate: length used as default
You could use tidyr...
library(tidyr)
df<-read.csv(text="1,age,30
1,race,black
1,scale_total,35
2,age,20
2,race,white
2,scale_total,99", header=FALSE, stringsAsFactors=FALSE)
df %>% spread(key=V2,value=V3)
V1 age race scale_total
1 1 30 black 35
2 2 20 white 99
We need a sequence column to be created to take care of the duplicate rows which by default results in aggregation to length
library(data.table)
dcast(setDT(df1), ID + rowid(Question) ~ Question, value.var = 'Response')
NOTE: The example data clearly works (giving expected output) without using the sequence column.
dcast(setDT(df1), ID ~ Question)
# ID age race scale_total
#1: 1 30 black 35
#2: 2 20 white 99
So, this is a case when applied on the full dataset with duplicate rows
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), Question = c("age",
"race", "scale_total", "age", "race", "scale_total"), Response = c("30",
"black ", "35", "20", "white", "99")), class = "data.frame",
row.names = c(NA, -6L))
For SPSS:
data list list/ID (f5) Question Response (2a20).
begin data
1 "age" "30"
1 "race" "black"
1 "scale_total" "35"
2 "age" "20"
2 "race" "white"
2 "scale_total" "99"
end data.
casestovars /id=id /index=question.
Note that the resulting variables age and scale_total will be string variables - you'll have to turn them into numbers before further transformations:
alter type age scale_total (f8).
I am trying to filter the output from RNA-seq data analysis. I want to generate a list of genes that fit the specified criteria in at least one experimental condition (dataframe).
For example, the data is output as a .csv, so I read in the whole directory, as follows.
readList = list.files("~/Path/To/File/", pattern = "*.csv")
files = lapply(readList, read.csv, row.names = 1)
#row.names = 1 sets rownames as gene names
This reads in 3 .csv files, A, B and C. The data look like this
A = files[[1]]
B = files[[2]]
C = files[[3]]
head(A)
logFC logCPM LR PValue FDR
YER037W -1.943616 6.294092 34.30835 0.000000004703583 0.00002276064
YJL184W -1.771273 5.840774 31.97088 0.000000015650144 0.00003786552
YFR053C 1.990102 10.107793 30.55576 0.000000032440747 0.00005232692
YDR342C 2.096877 6.534761 28.08635 0.000000116021451 0.00014035695
YGL062W 1.649138 8.940714 23.32097 0.000001370968319 0.00132682314
YFR044C 1.992810 9.302504 22.91553 0.000001692786468 0.00132736130
I then try to filter all of these to generate a list of genes (rownames) where two conditions must be met in at least one dataset.
1.logFC > 1 or < -1
2.FDR < 0.05
So I loop through the dataframes like so
genesKeep = ""
for (i in 1:length(files) {
F = data.frame(files[i])
sigGenes = rownames(F[F$FDR<0.05 & abs(F$logFC>1), ])
genesKeep = append(genesKeep, values = sigGenes)
}
This gives me a list of genes, however, when I sanity check these against the data some of the genes listed do not pass these thresholds, whilst other genes that do pass these thresholds are not present in the list.
e.g.
df = cbind(A,B,C)
genesKeep = unique(genesKeep)
logicTest = rownames(df) %in% genesKeep
dfLogic = cbind(df, logicTest)
whilst the majority of genes do infact pass the criteria I set, I see some discrepancies for a few genes. For example
A.logFC A.FDR B.logFC B.FDR C.logFC C.FDR logicTest
YGR181W -0.8050325 0.1462688 -0.6834184 0.2162317 -1.1923744 0.04049870 FALSE
YOR185C 0.8321432 0.1462919 0.7401477 0.2191413 -0.9616989 0.04098177 TRUE
The first gene (YGR181W) passes the criteria in condition C, where logFC < -1 and FDR < 0.05. However, the gene is not reported in the genesKeep list.
Conversely, the second gene (YOR185C) does not pass these criteria in any condition, but the gene is present in the genesKeep list.
I'm unsure where I'm going wrong here, but if anyone has any ideas they would be much appreciated.
Thanks.
Using merge as suggested by akash87 solved the problem.
Turns out cbind was causing the rownames to not be assigned correctly.
I'm not exactly sure what your desired output is here, but it might be possible to simplify a bit and use the dplyr library to filter all your outputs at once, assuming the format of your data is consistent. Using some modified versions of your data as an example:
A <- structure(list(gene = structure(c(2L, 6L, 4L, 1L, 5L, 3L), .Label = c("YDR342C",
"YER037W", "YFR044C", "YFR053C", "YGL062W", "YJL184W"), class = "factor"),
logFC = c(-1.943616, -1.771273, 0, 2.096877, 1.649138, 1.99281
), logCPM = c(6.294092, 5.840774, 10.107793, 6.534761, 8.940714,
9.302504), LR = c(34.30835, 31.97088, 30.55576, 28.08635,
23.32097, 22.91553), PValue = c(4.703583e-09, 1.5650144e-08,
3.2440747e-08, 1.16021451e-07, 1.370968319e-06, 1.692786468e-06
), FDR = c(2.276064e-05, 3.786552e-05, 5.232692e-05, 0.00014035695,
0.00132682314, 0.06)), .Names = c("gene", "logFC", "logCPM",
"LR", "PValue", "FDR"), class = "data.frame", row.names = c(NA,
-6L))
B <- structure(list(gene = structure(c(2L, 6L, 4L, 1L, 5L, 3L), .Label = c("YDR342C",
"YER037W", "YFR044C", "YFR053C", "YGL062W", "YJL184W"), class = "factor"),
logFC = c(-0.4, -0.3, 0, 2.096877, 1.649138, 1.99281), logCPM = c(6.294092,
5.840774, 10.107793, 6.534761, 8.940714, 9.302504), LR = c(34.30835,
31.97088, 30.55576, 28.08635, 23.32097, 22.91553), PValue = c(4.703583e-09,
1.5650144e-08, 3.2440747e-08, 1.16021451e-07, 1.370968319e-06,
1.692786468e-06), FDR = c(2.276064e-05, 3.786552e-05, 5.232692e-05,
0.00014035695, 0.1, 0.06)), .Names = c("gene", "logFC", "logCPM",
"LR", "PValue", "FDR"), class = "data.frame", row.names = c(NA,
-6L))
Use rbind to create a single dataframe to work with:
AB<- rbind(A,B)
Then filter this whole thing based on your criteria. Note that duplicates can occur, so you can use distinct to only return unique genes that qualify:
filter(AB, logFC < -1 | logFC > 1, FDR < 0.05) %>%
distinct(gene)
gene
1 YER037W
2 YJL184W
3 YDR342C
4 YGL062W
Or, to keep all the rows for those genes as well:
filter(AB, logFC < -1 | logFC > 1, FDR < 0.05) %>%
distinct(gene, .keep_all = TRUE)
gene logFC logCPM LR PValue FDR
1 YER037W -1.943616 6.294092 34.30835 4.703583e-09 2.276064e-05
2 YJL184W -1.771273 5.840774 31.97088 1.565014e-08 3.786552e-05
3 YDR342C 2.096877 6.534761 28.08635 1.160215e-07 1.403570e-04
4 YGL062W 1.649138 8.940714 23.32097 1.370968e-06 1.326823e-03
I am working with a dataset of hourly temperatures and I need to calculate "degree hours" above a heat threshold for each extreme event. I intend to run stats on the intensities (combined magnitude and duration) of each event to compare multiple sites over the same time period.
Example of data:
Temp
1 14.026
2 13.714
3 13.25
.....
21189 12.437
21190 12.558
21191 12.703
21192 12.896
Data after selecting only hours above the threshold of 18 degrees and then subtracting 18 to reveal degrees above 18:
Temp
5297 0.010
5468 0.010
5469 0.343
5470 0.081
5866 0.010
5868 0.319
5869 0.652
After this step I need help to sum consecutive hours during which the reading exceeded my specified threshold.
What I am hoping to produce out of above sample:
Temp
1 0.010
2 0.434
3 0.010
4 0.971
I've debated manipulating these data within a time series or by adding additional columns, but I do not want multiple rows for each warming event. I would immensely appreciate any advice.
This is an alternative solution in base R.
You have some data that walks around, and you want to sum up the points above a cutoff. For example:
set.seed(99999)
x <- cumsum(rnorm(30))
plot(x, type='b')
abline(h=2, lty='dashed')
which looks like this:
First, we want to split the data in to groups based on when they cross the cutoff. We can use run length encoding on the indicator to get a compressed version:
x.rle <- rle(x > 2)
which has the value:
Run Length Encoding
lengths: int [1:8] 5 2 3 1 9 4 5 1
values : logi [1:8] FALSE TRUE FALSE TRUE FALSE TRUE ...
The first group is the first 5 points where x > 2 is FALSE; the second group is the two following points, and so on.
We can create a group id by replacing the values in the rle object, and then back transforming:
x.rle$values <- seq_along(x.rle$values)
group <- inverse.rle(x.rle)
Finally, we aggregate by group, keeping only the data above the cut off:
aggregate(x~group, subset = x > 2, FUN=sum)
Which produces:
group x
1 2 5.113291213
2 4 2.124118005
3 6 11.775435706
4 8 2.175868979
I'd use data.table for this, although there are certainly other ways.
library( data.table )
setDT( df )
temp.threshold <- 18
First make a column showing the previous value from each one in your data. This will help to find the point at which the temperature rose above your threshold value.
df[ , lag := shift( Temp, fill = 0, type = "lag" ) ]
Now use that previous value column to compare with the Temp column. Mark every point at which the temperature rose above the threshold with a 1, and all other points as 0.
df[ , group := 0L
][ Temp > temp.threshold & lag <= temp.threshold, group := 1L ]
Now we can get cumsum of that new column, which will give each sequence after the temperature rose above the threshold its own group ID.
df[ , group := cumsum( group ) ]
Now we can get rid of every value not above the threshold.
df <- df[ Temp > temp.threshold, ]
And summarise what's left by finding the "degree hours" of each "group".
bygroup <- df[ , sum( Temp - temp.threshold ), by = group ]
I modified your input data a little to provide a couple of test events where the data rose above threshold:
structure(list(num = c(1L, 2L, 3L, 4L, 5L, 21189L, 21190L, 21191L,
21192L, 21193L, 21194L), Temp = c(14.026, 13.714, 13.25, 20,
19, 12.437, 12.558, 12.703, 12.896, 21, 21)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -11L), .Names = c("num",
"Temp"), spec = structure(list(cols = structure(list(num = structure(list(), class = c("collector_integer",
"collector")), Temp = structure(list(), class = c("collector_double",
"collector"))), .Names = c("num", "Temp")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
With that data, here's the output of the code above (note $V1 is in "degree hours"):
> bygroup
group V1
1: 1 3
2: 2 6
As part of a project, I am currently using R to analyze some data. I am currently stuck with the retrieving few values from the existing dataset which i have imported from a csv file.
The file looks like:
For my analysis, I wanted to create another column which is the subtraction of the current value of x and its previous value. But the first value of every unique i, x would be the same value as it is currently. I am new to R and i was trying various ways for sometime now but still not able to figure out a way to do so. Request your suggestions in the approach that I can follow to achieve this task.
Mydata structure
structure(list(t = 1:10, x = c(34450L, 34469L, 34470L, 34483L,
34488L, 34512L, 34530L, 34553L, 34575L, 34589L), y = c(268880.73342868,
268902.322359863, 268938.194698248, 268553.521856105, 269175.38273083,
268901.619719038, 268920.864512966, 269636.604121984, 270191.206593437,
269295.344751692), i = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L)), .Names = c("t", "x", "y", "i"), row.names = c(NA, 10L), class = "data.frame")
You can use the package data.table to obtain what you want:
library(data.table)
setDT(MyData)[, x_diff := c(x[1], diff(x)), by=i]
MyData
# t x i x_diff
# 1: 1 34287 1 34287
# 2: 2 34789 1 502
# 3: 3 34409 1 -380
# 4: 4 34883 1 474
# 5: 5 34941 1 58
# 6: 6 34045 2 34045
# 7: 7 34528 2 483
# 8: 8 34893 2 365
# 9: 9 34551 2 -342
# 10: 10 34457 2 -94
Data:
set.seed(123)
MyData <- data.frame(t=1:10, x=sample(34000:35000, 10, replace=T), i=rep(1:2, e=5))
You can use the diff() function. If you want to add a new column to your existing data frame, the diff function will return a vector x-1 length of your current data frame though. so in your case you can try this:
# if your data frame is called MyData
MyData$newX = c(NA,diff(MyData$x))
That should input an NA value as the first entry in your new column and the remaining values will be the difference between sequential values in your "x" column
UPDATE:
You can create a simple loop by subsetting through every unique instance of "i" and then calculating the difference between your x values
# initialize a new dataframe
newdf = NULL
values = unique(MyData$i)
for(i in 1:length(values)){
data1 = MyData[MyData$i = values[i],]
data1$newX = c(NA,diff(data1$x))
newdata = rbind(newdata,data1)
}
# and then if you want to overwrite newdf to your original dataframe
MyData = newdf
# remove some variables
rm(data1,newdf,values)
Lets say i have a very large data.frame containing scores per column.
for example:
MA0001.1 AGL3 MA0003.1 TFAP2A MA0004.1 Arnt MA0005.1 AG MA0006.1 Arnt::Ahr
7.789524e-09 0.4012127249 3.771518e-03 1.892011e-06 0.002733200
5.032498e-07 0.0001873801 9.947449e-05 3.284222e-05 0.001367041
1.194487e-06 0.0009357406 6.943634e-05 1.589373e-05 0.002551519
4.833494e-06 0.0150703600 1.003488e-04 1.197928e-03 0.001431416
6.865040e-05 0.0000732607 3.857193e-04 5.388744e-03 0.001363706
R data.frame:
testfr<-structure(list(`MA0001.1 AGL3` = c(7.78952366977488e-09, 5.03249791215203e-07,
1.19448739380034e-06, 4.83349413748598e-06, 6.86504034402563e-05
), `MA0003.1 TFAP2A` = c(0.401212724871542, 0.000187380067026448,
0.000935740631438077, 0.0150703600158589, 7.32607018758816e-05
), `MA0004.1 Arnt` = c(0.00377151826447817, 9.94744903768433e-05,
6.94363387424972e-05, 0.000100348764966112, 0.00038571926458373
), `MA0005.1 AG` = c(1.89201084302835e-06, 3.2842217133538e-05,
1.58937284554136e-05, 0.00119792816070882, 0.00538874414923338
), `MA0006.1 Arnt::Ahr` = c(0.00273319966783363, 0.00136704060025893,
0.00255151921946167, 0.00143141576426544, 0.00136370552325235
)), .Names = c("MA0001.1 AGL3", "MA0003.1 TFAP2A", "MA0004.1 Arnt",
"MA0005.1 AG", "MA0006.1 Arnt::Ahr"), class = "data.frame", row.names = c(4L,
2L, 5L, 1L, 3L))
Now i want to select the column with the highest values in it and place that column first.
So the values of 1 column should stay below the same column name and the entire column should move by rank.
I tried the following:
ranked<-unlist(lapply(testfr,rank))
testranked<-testfr[ranked, ]
this produces a data frame with 2259obs*459vars while the original was 5*459.
Note that, testfr is a data.frame derived from a function which scores sequences on to a list of matrices! And gives that score back into a data.frame where the rows are the sequences and the columns are the matrices.
I know i do something wrong with the indexing or unlisting but i dont have any clue how to fix this. Any help is appreciated.
How about this?
> testfr[rev(order(sapply(testfr, max, na.rm = TRUE)))]
Break down:
sapply(test.fr, max, na.rm = TRUE) # get max of each column (after removing NA)
order(.) # get the order of these values in increasing order
rev(.) # get the reverse order so that highest value index stays first
testfr[.] # get the columns in this order back
I would use apply for readability,
testfr[order(apply(testfr, 2, max, na.rm = TRUE),decreasing=T)]
I apply max for each margin , column here, Then I sort column in decreasing order.