How to process multi columns data in data.frame with plyr

How to process multi columns data in data.frame with plyr - r

I am trying to solve the DSC(Differential scanning calorimetry) data with R but it seems that I ran into some troubles. All this used to be done in Origin or Qtiplot tediously in my lab.But I wonder if there is another way to do it in batch.But the result did not goes well. For example, maybe I have used the wrong colnames of my data.frame,the code
dat$0.5min
Error: unexpected numeric constant in "dat$0.5"
can not reach my data.
So below is the full description of my purpose, thank you in advance!
the DSC data is like this（I store the CSV file in my GoogleDrive Link ）　:
T1 0.5min T2 1min
40.59 -0.2904 40.59 -0.2545
40.81 -0.281 40.81 -0.2455
41.04 -0.2747 41.04 -0.2389
41.29 -0.2728 41.29 -0.2361
41.54 -0.2553 41.54 -0.2239
41.8 -0.07 41.8 -0.0732
42.06 0.1687 42.06 0.1414
42.32 0.3194 42.32 0.2817
42.58 0.3814 42.58 0.3421
42.84 0.3863 42.84 0.3493
43.1 0.3665 43.11 0.3322
43.37 0.3438 43.37 0.3109
43.64 0.3265 43.64 0.2937
43.9 0.3151 43.9 0.2819
44.17 0.3072 44.17 0.2735
44.43 0.2995 44.43 0.2656
44.7 0.2899 44.7 0.2563
44.96 0.2779 44.96 0.245
in fact I have merge the data into a data.frame and hope I can adjust it and do something further.
the command is:
dat<-read.csv("Book1.csv",header=F)
colnames(dat)<-c('T1','0.5min','T2','1min','T3','2min','T4','4min','T5','8min','T6','10min',
'T7','20min','T8','ascast1','T9','ascast2','T10','ascast3','T11','ascast4',
'T12','ascast5'
)
so actually dat is a data.frame with 1163 obs. of 24 variables.
T1,T2,T3.....T12 means temperature that the samples were tested of DSC although in the same interval they do differ a little due to the unstability of the machine.
And the colname along T1~T12 is Heat Flow of different heat treatment durations that records by the machine and ascast1~ascast5 means nothing done to the sample to check the accuracy of the machine.
Now I need to do something like the following:
for T1~T2 is in Celsius Degrees，I need to change them into Kelvin Degrees whichi means every data plus 273.16.
Two temperature is chosen to compare the result that is Ts=180.25,Te=240.45(all is discussed in Celsius Degrees and I have seen it Qtiplot to make sure). To be clear I list the two temperature and the first 6 columns data.
T1 0.5min T2 1min T3 2min T4 4min
180.25 -0.01710000 180.25 -0.01780000 180.25 -0.02120000 180.25 -0.02020000
. . . .
. . . .
240.45 0.05700000 240.45 0.04500000 240.45 0.05780000 240.45 0.05580000
That all Heat Flow in Ts should be the same that can be made 0 for convenience. So based on the different values Heat Flow of different times like 0.5min,1min,2min,4min,8min,10min,20min and ascas1~ascast5 all Heat Flow value should be minus the Heat Flow value in Ts.
And for Heat Flow in Te, the value should be adjust to make sure that all the Heat Flow data are the same in Te. The purpose is like the following, (1) calculate mean of the 12 heat flow data in Te. Let's use Hmean for the mean heat flow.So Hmean is the value that all Heat Flow should be. (2) for data in column 0.5min,I use col("0.5min") to denote, and the lineal transform formula is like the following:
col("0.5min")-[([0.05700000-(-0.01710000)]-Hmean)/(Te-Ts)]*(col(T1)-Ts)
Actually, [0.05700000-(-0.01710000)] is done in step 2,but I write it for your reference. And this formula is used for different pair of T1~T12 and columns,like (T1,0.5min),(T2, 1min),(T3,1min).....all is 12 pairs.
Now we can plot the 12 pairs of data on the same plot with intervals from 180~240(also in Celsius Degrees) to magnify the details of differences between the different scans of DSC.
I have been stuck on this problems for 2 days , so I return to stackoverflow for help.
Thanks!

I am assuming that your question was right in the beginning where you got the following error,
dat$0.5min
Error: unexpected numeric constant in "dat$0.5"
As I could not find a question in the rest of the steps. They just seemed like a step by step procedure of an experiment.
To fix that error, the problem is the column name has a number in it so to use the column name in the way you want (to reference a column), you should use "`", accent mark, symbol.
>dataF <- data.frame("0.5min"=1:10,"T2"=11:20,check.names = F)
> dataF$`0.5min`
[1] 1 2 3 4 5 6 7 8 9 10
Based on comments adding more information,
You can add a constant to add to alternate columns in the following manner,
dataF <- data.frame(matrix(1:100,10,10))
const <- 237
> print(dataF)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 1 11 21 31 41 51 61 71 81 91
2 2 12 22 32 42 52 62 72 82 92
3 3 13 23 33 43 53 63 73 83 93
4 4 14 24 34 44 54 64 74 84 94
5 5 15 25 35 45 55 65 75 85 95
6 6 16 26 36 46 56 66 76 86 96
7 7 17 27 37 47 57 67 77 87 97
8 8 18 28 38 48 58 68 78 88 98
9 9 19 29 39 49 59 69 79 89 99
10 10 20 30 40 50 60 70 80 90 100
dataF[,seq(1,ncol(dataF),by = 2)] <- dataF[,seq(1,ncol(dataF),by = 2)] + const
> print(dataF)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 238 11 258 31 278 51 298 71 318 91
2 239 12 259 32 279 52 299 72 319 92
3 240 13 260 33 280 53 300 73 320 93
4 241 14 261 34 281 54 301 74 321 94
5 242 15 262 35 282 55 302 75 322 95
6 243 16 263 36 283 56 303 76 323 96
7 244 17 264 37 284 57 304 77 324 97
8 245 18 265 38 285 58 305 78 325 98
9 246 19 266 39 286 59 306 79 326 99
10 247 20 267 40 287 60 307 80 327 100
To generalize, we know that the columns of a dataframe can be referenced with a vector of numbers/column names. Most operations in R are vectorized. You can use column names or numbers based on the pattern you are looking for.
For example, I change the name of my first two columns and want to access just those I do this,
colnames(dataF)[c(1,2)] <- c("Y1","Y2")
#Reference all column names with "Y" in it. You can do any operation you want on this.
dataF[,grep("Y",colnames(dataF))]
Y1 Y2
1 238 11
2 239 12
3 240 13
4 241 14
5 242 15
6 243 16
7 244 17
8 245 18
9 246 19
10 247 20

Related

Am I able to get a specific P-value to see where the significance lies?

So these are the survey results. I have tried to do pairwise testing (pairwise.wilcox.test) for these results collected in Spring and Autumn for these sites. But I can't get a specific P -value as to which site has the most influence.
This is the error message I keep getting. My dataset isn't even, ie there were some of the sites that were not surveyed in Spring which I think may be the issue.
Error in wilcox.test.default(xi, xj, paired = paired, ...) :
'x' must be numeric
So I'm not sure if I have laid it out in the table wrong to see how much site influences the results between Spring and Autumn
Site Autumn Spring
Stokes Bay 25 6
Stokes Bay 54 6
Stokes Bay 31 0
Gosport Wall 213 16
Gosport Wall 24 19
Gosport Wall 54 60
No Mans Land 76 25
No Mans Land 66 68
No Mans Land 229 103
Osbourne 1 77
Osbourne 1 92
Osbourne 1 92
Osbourne 2 114 33
Osbourne 2 217 114
Osbourne 2 117 64
Osbourne 3 204 131
Osbourne 3 165 85
Osbourne 3 150 81
Osbourne 4 124 15
Osbourne 4 79 64
Osbourne 4 176 65
Ryde Roads 217 165
Ryde Roads 182 63
Ryde Roads 112 53
Ryde Sands 386 44
Ryde Sands 375 25
Ryde Sands 147 45
Spit Bank 223 23
Spit Bank 78 29
Spit Bank 60 15
St Helen's 1 247 11
St Helen's 1 126 36
St Helen's 1 107 20
St Helen's 2 108 115
St Helen's 2 223 25
St Helen's 2 126 30
Sturbridge 58 43
Sturbridge 107 34
Sturbridge 156 0
Osbourne Deep 1 76 59
Osbourne Deep 1 64 52
Osbourne Deep 1 77 30
Osbourne Deep 2 153 60
Osbourne Deep 2 106 88
Osbourne Deep 2 74 35
Sturbridge Shoal 169 45
Sturbridge Shoal 19 84
Sturbridge Shoal 81 44
Mother's Bank 208
Mother's Bank 119
Mother's Bank 153
Ryde Middle 16
Ryde Middle 36
Ryde Middle 36
Stanswood 14 132
Stanswood 47 87
Stanswood 14 88
This is what I've done so far:
MWU <- read.csv(file.choose(), header = T)
#attach file to workspace
attach(MWU)
#Read column names of the data
colnames(MWU) # Site, Autumn, Spring
MWU.1 <- MWU[c(1,2,3)] #It included blank columns in the df
kruskal.test(MWU.1$Autumn ~ MWU.1$Site)
#Kruskal-Wallis rank sum test
#data: MWU.1$Autumn by MWU.1$Site
#Kruskal-Wallis chi-squared = 36.706, df = 24, p-value = 0.0468
kruskal.test(MWU.1$Spring ~ MWU.1$Site)
#Kruskal-Wallis rank sum test
#data: MWU.1$Spring by MWU.1$Site
#Kruskal-Wallis chi-squared = 35.134, df = 21, p-value = 0.02729
wilcox.test(MWU.1$Autumn, MWU.1$Spring, paired = T)
#Wilcoxon signed rank exact test
#data: MWU.1$Autumn and MWU.1$Spring**
#V = 1066, p-value = 8.127e-08**
#alternative hypothesis: true location shift is not equal to 0******
#Tried this version too to see if it would give a summary of where the influence is.
pairwise.wilcox.test(MWU.1$Spring, MWU.1$Autumn)
#Error in wilcox.test.default(xi, xj, paired = paired, ...) : not enough (non-missing) 'x' observations

Script out of bounds in R

I am using a code based on Deseq2. One of my goals is to plot a heatmap of data.
heatmap.data <- counts(dds)[topGenes,]
The error I am getting is
Error in counts(dds)[topGenes, ]: subscript out of bounds
the first few line sof my counts(dds) function looks like this.
99h1 99h2 99h3 99h4 wth1 wth2
ENSDARG00000000002 243 196 187 117 91 96
ENSDARG00000000018 42 55 53 32 48 48
ENSDARG00000000019 91 91 108 64 95 94
ENSDARG00000000068 3 10 10 10 30 21
ENSDARG00000000069 55 47 43 53 51 30
ENSDARG00000000086 46 26 36 18 37 29
ENSDARG00000000103 301 289 289 199 347 386
ENSDARG00000000151 18 19 17 14 22 19
ENSDARG00000000161 16 17 9 19 10 20
ENSDARG00000000175 10 9 10 6 16 12
ENSDARG00000000183 12 8 15 11 8 9
ENSDARG00000000189 16 17 13 10 13 21
ENSDARG00000000212 227 208 259 234 78 69
ENSDARG00000000229 68 72 95 44 71 64
ENSDARG00000000241 71 92 67 76 88 74
ENSDARG00000000324 11 9 6 2 8 9
ENSDARG00000000370 12 5 7 8 0 5
ENSDARG00000000394 390 356 339 283 313 286
ENSDARG00000000423 0 0 2 2 7 1
ENSDARG00000000442 1 1 0 0 1 1
ENSDARG00000000472 16 8 3 5 7 8
ENSDARG00000000476 2 1 2 4 6 3
ENSDARG00000000489 221 203 169 144 84 114
ENSDARG00000000503 133 118 139 89 91 112
ENSDARG00000000529 31 25 17 26 15 24
ENSDARG00000000540 25 17 17 10 28 19
ENSDARG00000000542 15 9 9 6 15 12
How do I ensure all the elements of the top genes are present in it?
When I try to see 20 top genes in the dataset. it looks like a list of genes
6339" "12416" "1241" "3025" "12791" "846" "15090"
[8] "6529" "14564" "4863" "12777" "1122" "7454" "13716"
[15] "5790" "3328" "1231" "13734" "2797" "9072" with the column head V1.
I have used both
topGenes <- read.table("E://mir99h50 Cheng data//topGenesresordered.txt",header = TRUE)
and
topGenes <- read.table("E://mir99h50 Cheng data//topGenesresordered.txt",header = FALSE)
to see if the out of bounds error is removed. However it was of no use. I guess the V1 head is causing the issue.
The top genes function has been generated using the above code snippet.
resordered <- res[order(res$padj),]
#Reorder gene list by increasing pAdj
resordered <- as.data.frame(res[order(res$padj),])
#Filter for genes that are differentially expressed with an FDR < 0.01
ii <- which(res$padj < 0.01)
length(ii)
# Use the rownames() function to get the top 20 differentially expressed genes from our results table
topGenes <- rownames(resordered[1:20,])
topGenes
# Get the counts from the DESeqDataSet using the counts() function
heatmap.data <- counts(dds)[topGenes,]

Perhaps this will do what you want?
counts_dds <- counts(dds)
topgenes <- c("ENSDARG00000000002", "ENSDARG00000000489", "ENSDARG00000000503",
"ENSDARG00000000540", "ENSDARG00000000529", "ENSDARG00000000542")
heatmap.data <- counts_dds[rownames(counts_dds) %in% topgenes,]
If you provide more information it will be easier to advise you on how to fix your problem.

correlation between different matrices R

I´m trying to create a correlation (with p values) between two different matrices (operational taxonomic units versus environmental paramenters) in R
The first table is this
biotic1 biotic2
T1 1.540184 3.080025
T2 1.354927 5.012977
T3 1.449712 4.715981
T4 1.146659 2.442083
X1 1.705184 3.881878
X2 1.182721 3.014836
X3 1.536956 2.636719
X4 1.808025 4.434525
A1 1.132737 2.135737
A2 1.506048 3.114281
A3 1.285308 4.363828
A4 3.008994 7.290423
and the second table
OTU1 OTU2 OTU3 OTU4 OTU5 OTU6 OTU7 OTU8
T1 109 80 175 14 71 46 61 39
T2 102 48 26 8 23 5 35 10
T3 26 19 61 3 68 13 10 29
T4 143 56 9 11 16 13 49 24
X1 70 36 20 15 39 9 26 12
X2 39 33 12 32 15 2 11 3
X3 43 17 2 14 8 2 7 2
X4 160 60 8 26 25 7 9 15
A1 90 73 41 15 22 23 33 7
A2 344 109 18 28 22 13 93 16
A3 65 16 15 9 5 10 18 6
A4 141 140 6 86 18 3 43 4
I have already tried cor() and corr.test() but it only seens to correlate values from the first table
Any suggestion?
Thank you very much
F

It's not clear to me what result you are expecting.. However if you want to perform a simple correlation test, you must have your matrices in a vector format. You can try something like:
cor(c(as.matrix(your_matrix1)), c(as.matrix(your_matrix2)))
or
cor.test(c(as.matrix(your_matrix1)), c(as.matrix(your_matrix2)))
and see if one of these options meets your expectations.
However it makes more sense to me to explore your datasets with a canonical correlation analysis. Using base R you can use:
cancor(matrix1, matrix2)
you can also use some packages that have a set of tools to interpret the results (e.g. library(CCA))

How to define range of values of a time series?

First of all, sorry for any mistakes regarding my post, I'm new to this site.
I´m getting started with R now and I´m trying to do some analysis with time series data.
So, I got a times series at hand and already loaded it into R.
I can also plot this times series and add labels to the axes and so on. So far so good.
My problem: When I plot the time series, R would set the range of values on the y-axis to the interval of [0:170] approximately.
This is somehow strange, since the times series contains the daily EUR/USD exchange rates for this year. That means the values are in a range of about 1.05 to 1.2.
The relative values are correct.
If the plot shows a maximum around day 40, the corresponding value in the data set appears to be a maximum.
But it is around 1.4 and not 170.
I hope one can understand my problem.
I would like to have the y-axis on a scale from 1 to 1.2 for example.
The ylim=c(1, 1.2) command will scale the axis to that range but not the values.
It just ignores them.
Does anyone know how to adjust that?
I´d really appreciate it.
Thank you very much in advance.
Thanks a lot for the input so far.
The "critical code" is the following:
> FRB <- read.csv("FRB_H10.csv", header=TRUE, sep=",")
> attach(FRB)
> str(FRB)
'data.frame': 212 obs. of 2 variables:
$ Date: Factor w/ 212 levels "2015-01-01","2015-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Rate: Factor w/ 180 levels "1.0524","1.0575",..: 180 179 177 178 174 173 175 176 171 172 ...
> plot.ts(Rate)
The result of this last plot is the one shown above.
Changing the variable to numeric yields this:
> as.numeric(Rate)
[1] 180 179 177 178 174 173 175 176 171 172 170 166 180 167 169 160 123 128 150 140 132 128 138 165
[25] 161 163 136 134 134 129 159 158 180 156 140 155 151 142 131 148 104 100 96 104 65 53 27 24
[49] 13 3 8 1 2 7 10 9 21 42 36 50 39 33 23 15 19 29 51 54 26 23 11 6
[73] 4 12 5 16 20 18 17 14 22 30 34 49 92 89 98 83 92 141 125 110 81 109 151 149
[97] 162 143 85 69 77 61 180 30 32 38 52 37 78 127 120 73 105 126 131 106 122 119 107 112
[121] 157 137 152 96 93 99 87 94 86 70 71 180 67 43 66 58 84 57 55 47 35 25 26 41
[145] 31 48 48 75 63 59 38 60 46 44 28 40 45 52 62 101 82 74 68 60 64 102 144 168
[169] 159 154 108 91 98 118 111 72 76 180 95 90 117 139 131 116 130 133 145 103 79 88 115 97
[193] 106 113 89 102 121 102 119 114 124 148 180 153 164 161 147 135 146 141 80 56
So, it remains unchanged. This is very strange. The data excerpt shows that "Rate" takes on values between 1.1 and 1.5 approximately, so really not the values that are shown above. :/
The data set can be found under this link:
https://www.dropbox.com/s/ndxstdl1aae5glt/FRB_H10.csv?dl=0
It should be alright. I got it from the data base from the Federal Reserve System, so quite a decent source.
(Had to remove the link to the data excerpt because my reputation only allows for 2 links to be posted at a time. But the entire data set should be even better, I guess.

#BlankUsername
Thanks very much for the link. I got it working now using this code:
FRB <- read.csv("FRB_H10.csv", header=TRUE, sep=",")
> attach(FRB)
> as.numeric(paste(Rate))
[1] NA 1.2015 1.1918 1.1936 1.1820 1.1811 1.1830 1.1832 1.1779 1.1806 1.1598 1.1517 NA
[14] 1.1559 1.1584 1.1414 1.1279 1.1290 1.1370 1.1342 1.1308 1.1290 1.1337 1.1462 1.1418 1.1432
[27] 1.1330 1.1316 1.1316 1.1300 1.1410 1.1408 NA 1.1395 1.1342 1.1392 1.1372 1.1346 1.1307
[40] 1.1363 1.1212 1.1197 1.1190 1.1212 1.1070 1.1006 1.0855 1.0846 1.0707 1.0576 1.0615 1.0524
[53] 1.0575 1.0605 1.0643 1.0621 1.0792 1.0928 1.0908 1.0986 1.0919 1.0891 1.0818 1.0741 1.0768
[66] 1.0874 1.0990 1.1008 1.0850 1.0818 1.0671 1.0598 1.0582 1.0672 1.0596 1.0742 1.0780 1.0763
[79] 1.0758 1.0729 1.0803 1.0876 1.0892 1.0979 1.1174 1.1162 1.1194 1.1145 1.1174 1.1345 1.1283
[92] 1.1241 1.1142 1.1240 1.1372 1.1368 1.1428 1.1354 1.1151 1.1079 1.1126 1.1033 NA 1.0876
[105] 1.0888 1.0914 1.0994 1.0913 1.1130 1.1285 1.1271 1.1108 1.1232 1.1284 1.1307 1.1236 1.1278
[118] 1.1266 1.1238 1.1244 1.1404 1.1335 1.1378 1.1190 1.1178 1.1196 1.1156 1.1180 1.1154 1.1084
[131] 1.1090 NA 1.1076 1.0952 1.1072 1.1025 1.1150 1.1020 1.1015 1.0965 1.0898 1.0848 1.0850
[144] 1.0927 1.0884 1.0976 1.0976 1.1112 1.1055 1.1026 1.0914 1.1028 1.0962 1.0953 1.0868 1.0922
[157] 1.0958 1.0994 1.1042 1.1198 1.1144 1.1110 1.1078 1.1028 1.1061 1.1200 1.1356 1.1580 1.1410
[170] 1.1390 1.1239 1.1172 1.1194 1.1263 1.1242 1.1104 1.1117 NA 1.1182 1.1165 1.1262 1.1338
[183] 1.1307 1.1260 1.1304 1.1312 1.1358 1.1204 1.1133 1.1160 1.1252 1.1192 1.1236 1.1246 1.1162
[196] 1.1200 1.1276 1.1200 1.1266 1.1249 1.1282 1.1363 NA 1.1382 1.1437 1.1418 1.1360 1.1320
[209] 1.1359 1.1345 1.1140 1.1016
Warning message:
NAs introduced by coercion
> Rate <- cbind(paste(Rate))
> plot(Rate)
Warning message:
In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion
> plot.ts(Rate, ylab="EUR/USD")
Despite the warning message, I get the following output (shown below). Like I intended to plot it.
Nevertheless, I do not really understand why it works the way it did. Why I have to use the paste() command and what it does exactly. I get the basic idea of what the classes do, but am very new to this whole world of R.
One thing I came to realize already is that R is such a powerful program. And yet confusing if you are a beginner. :D

Plot histogram by first sorting data and then dividing x values into bins in R

I have a dataset in a given format:
USER.ID avgfrequency
1 3 3.7821782
2 7 14.7500000
3 9 13.4761905
4 13 5.1967213
5 16 6.7812500
6 26 41.7500000
7 49 13.6666667
8 50 7.0000000
9 51 1.0000000
10 52 17.7500000
11 69 4.5000000
12 75 9.9500000
13 91 84.2000000
14 98 8.0185185
15 138 14.2000000
16 139 34.7500000
17 149 7.6666667
18 155 35.3333333
19 167 24.0000000
20 170 7.3529412
21 171 4.4210526
22 175 6.5781250
23 176 19.2857143
24 177 10.4864865
25 178 28.0000000
26 180 4.8461538
27 183 25.5000000
28 184 13.0000000
29 210 32.0000000
30 215 13.4615385
31 220 11.3611111
32 223 26.2500000
I want to first sort the dataset by avgfrequency and then I want to plot count of USER.ID's that fall under different bin categories.
I want to divide avgfrequency into different bin categories of width 10.
I am trying to sort data using:
user_avgfrequency <- user_avgfrequency[order(user_avgfrequency[,1]), ]
but getting an error.

df <- data.frame(USER.ID=c(3,7,9,13,16,26,49,50,51,52,69,75,91,98,138,139,149,155,167,170,171,175,176,177,178,180,183,184,210,215,220,223), avgfrequency=c(3.7821782,14.7500000,13.4761905,5.1967213,6.7812500,41.7500000,13.6666667,7.0000000,1.0000000,17.7500000,4.5000000,9.9500000,84.2000000,8.0185185,14.2000000,34.7500000,7.6666667,35.3333333,24.0000000,7.3529412,4.4210526,6.5781250,19.2857143,10.4864865,28.0000000,4.8461538,25.5000000,13.0000000,32.0000000,13.4615385,11.3611111,26.2500000) );
breaks <- seq(0,ceiling(max(df$avgfrequency)/10)*10,10);
cols <- colorRampPalette(c('blue','green','red'))(length(breaks)-1);
hist(df$avgfrequency,breaks,col=cols,axes=F,xlab='Average Frequency',ylab='Count');
axis(1,breaks);
axis(2,0:max(tabulate(cut(df$avgfrequency,breaks))));

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to process multi columns data in data.frame with plyr - r

Related

Am I able to get a specific P-value to see where the significance lies?

Script out of bounds in R

correlation between different matrices R

How to define range of values of a time series?

Plot histogram by first sorting data and then dividing x values into bins in R

Categories

Resources