Selecting/sorting values withing table R - r

I'm working in R with the following dataset for a metabolomics study.
first Name Area Sample Similarity
120 Pentanone 699468 PO4:1 954
120 Pentanone 153744 PO2:1 981
126 Methylamine 83528 PO4:1 887
126 Unknown 32741 PO2:1 645
126 Sulfurous 43634 PO1:1 800
I want to be able to selected in the first column, within the rowns with same value (for example 120), the compounds with same name (for example pentanone). From this selection I want to copy the row information that corresponds to the highest similarity and created new columns within the table. In this case the following information:
120 Pentanone 153744 PO2:1 981
I know that "send me the code posts" are not very appreciated by I would greatly appreciated some clues on how to start.

You can use plyr package:
I reproduce your data ( try to use dput(dat) next time)
dat <- read.table(text ='first Name Area Sample Similarity
120 Pentanone 699468 PO4:1 954
120 Pentanone 153744 PO2:1 981
126 Methylamine 83528 PO4:1 887
126 Unknown 32741 PO2:1 645
126 Sulfurous 43634 PO1:1 800',header=TRUE)
I split my data.frame by (first & Name)
I apply the function fo each set of rows
I aggregate in a new data.frame
library(plyr)
ddply(dat,.(first,Name),function(x) x[x$Similarity==max(x$Similarity),])
first Name Area Sample Similarity
1 120 Pentanone 153744 PO2:1 981
2 126 Methylamine 83528 PO4:1 887
3 126 Sulfurous 43634 PO1:1 800
4 126 Unknown 32741 PO2:1 645

There are many options. You already have one example using plyr; here are two more.
Base R approach, using aggregate and merge:
merge(dat, aggregate(Similarity ~ first + Name, dat, max))
# first Name Similarity Area Sample
# 1 120 Pentanone 981 153744 PO2:1
# 2 126 Methylamine 887 83528 PO4:1
# 3 126 Sulfurous 800 43634 PO1:1
# 4 126 Unknown 645 32741 PO2:1
A sqldf approach:
library(sqldf)
sqldf("select *, max(Similarity) `Similarity` from dat group by first, Name")
# first Name Similarity Area Sample
# 1 120 Pentanone 981 153744 PO2:1
# 2 126 Methylamine 887 83528 PO4:1
# 3 126 Sulfurous 800 43634 PO1:1
# 4 126 Unknown 645 32741 PO2:1

Related

Remove row with specific value

I have the following data:
library(data.table)
sales <- data.table(Customer = c(192,964,929,345,898,477,705,804,188,231,780,611,420,816,171,212,504,526,471,979,524,410,557,152,417,359,435,820,305,268,763,194,757,475,351,933,805,687,813,880,798,327,602,710,785,840,446,891,165,662),
Producttype = c(1,2,3,2,3,3,2,1,3,3,1,1,2,2,1,3,1,3,3,1,1,1,1,3,3,3,3,2,1,1,3,3,3,3,1,1,3,3,3,2,3,2,3,3,3,2,1,2,3,1),
Price = c(469,721,856,956,554,188,429,502,507,669,427,582,574,992,418,835,652,983,149,917,370,617,876,337,663,252,599,949,915,556,313,842,892,724,415,307,900,114,439,456,541,261,881,757,199,308,958,374,409,738),
Quarter = c(2,3,3,4,4,1,4,4,3,3,1,1,1,1,1,1,4,1,2,1,3,1,2,3,3,4,4,1,1,4,1,1,3,2,1,3,3,2,2,2,1,4,3,3,1,1,1,3,1,1))
How can I remove (let's say) the row in which Customer = 891?
And then I have another question:
If I want to manipulate the data I use data [row, column]. But when I want to use only the rows in which Quarter equals (for example) 4. I use data [Quarter = 4,] Why is it not data [, Quarter = 4] since Quarter is a column and not a row?
I did not find an appropriate answer in the internet which really explains the why.
Thank you.
You have used 'data.table' function to import your data, so you could write :
sales[Customer != 891,]
The data[Quarter = 4, ], ensures that all columns should be returned for the rows where Quarter is equal to 4. The comma(,) is necessary to only select the rows, and not the column Quarter = 4.
When you use indexing, ie, data[row, column] you are telling R to look for either a specific row or column index.
Row: sales[sales$Customer %in% c(192,964),] translates to "search the specific column Customer in the data frame (or table) for any rows that have values that contain 192 or 964 and isolate them. Note that data.table will allow for sales[Customer %in% c(192, 964),] but data frames cant (use sales[sales$Customer %in% c(192,964),])
Customer Producttype Price Quarter
1: 192 1 469 2
2: 964 2 721 3
Columns sales[, "Customer"] translates to "search the data frame (or table) for columns named "Customer" and isolate all its rows
Customer
1: 192
2: 964
3: 929
4: 345
5: 898
...
Note this returns a data table with one column. If you use sales[,Customer] (data table) or sales$Customer (data frame), it will return a vector:
# [1] 192 964 929 345 898 477 705 804 188 231 780 611 420 816 171 212 504 526 471 979 524
# [22] 410 557 152 417 359 435 820 305 268 763 194 757 475 351 933 805 687 813 880 798 327
# [43] 602 710 785 840 446 891 165 662
You can of course combine - if you did, sales[sales$Quarter %in% 1:2, c("Customer", "Producttype")] you would isolate all values of Customer and Producttype which were in quarters 1 and 2:
Customer Producttype
1: 192 1
2: 477 3
3: 780 1
4: 611 1
5: 420 2
...

R - Percentage of whole dataframe per column

I have a data frame reporting the count of answers per question (this is just a part of it), and I'd like to obtain the answer percentage for each question. I've found adorn_percentages, but it computes the percentage by dividing the values for the whole data frame, meanwhile, I just want the percentage for each column. Each column has a total of 2230 answers.
I was thinking to use something like (x/2230)*100 but I don't know how to go on.
df<-data.frame(q1=c(159,139,1048,571,93), q2=c(106,284,1043,672,125), q3=c(99,222,981,843,94))
q1 q2 q3
1 159 106 99
2 139 284 222
3 1048 1043 981
4 571 672 843
5 93 125 94
We may use colSums to do the division after making the lengths same
100 * df/colSums(df)[col(df)]
or use sweep
100 * sweep(df, 2, colSums(df), `/`)
Or use proportions
df[paste0(names(df), "_prop")] <- 100 * proportions(as.matrix(df), 2)
-output
> df
q1 q2 q3 q1_prop q2_prop q3_prop
1 159 106 99 7.910448 4.753363 4.421617
2 139 284 222 6.915423 12.735426 9.915141
3 1048 1043 981 52.139303 46.771300 43.814203
4 571 672 843 28.407960 30.134529 37.650737
5 93 125 94 4.626866 5.605381 4.198303
You can apply prop.table for each column -
library(dplyr)
df %>% mutate(across(.fns = prop.table, .names = '{col}_prop') * 100)
# q1 q2 q3 q1_prop q2_prop q3_prop
#1 159 106 99 7.910448 4.753363 4.421617
#2 139 284 222 6.915423 12.735426 9.915141
#3 1048 1043 981 52.139303 46.771300 43.814203
#4 571 672 843 28.407960 30.134529 37.650737
#5 93 125 94 4.626866 5.605381 4.198303

Prevent duplicates in R

I have a column in a data table which has entries in non-decreasing order. But there can be duplicate entries.
labels <- c(123,123,124,125,126,126,128)
time <- data.table(labels,unique_labels="")
time
labels unique_labels
1: 123
2: 123
3: 124
4: 125
5: 126
6: 126
7: 128
I want to make all entries unique, so the output will be
time
labels unique_labels
1: 123 123
2: 123 124
3: 124 125
4: 125 126
5: 126 127
6: 126 128
7: 128 130
Following is a loop implementation for this:
prev_label <- 0
unique_counter <- 0
for (i in 1:length(time$label)){
if (time$label[i]!=prev_label)
prev_label <- time$label[i]
else
unique_counter <- unique_counter + 1
time$unique_label[i] <- time$label[i] + unique_counter
}
There's a vectorized solution that completly prevents you from using for loops.
Since time is a R function I've changed the name of your data.frame to tm.
cumsum(duplicated(tm$labels)) + tm$labels
[1] 123 124 125 126 127 128 130
tm$unique_labels <- cumsum(duplicated(tm$labels)) + tm$labels
tm
labels unique_labels
1: 123 123
2: 123 124
3: 124 125
4: 125 126
5: 126 127
6: 126 128
7: 128 130
tank = ("t", 1:NROW(labels), sep="")
time$unique_labels = ifelse(duplicated(time), tank, time$labels)
the duplicated function of the data.table package returns the index of duplicated rows of your dataset, just replace them with "random" values you are sure are not used in your set

adding and subtracting values in multiple data frames of different lengths - flow analysis

Thank you jakub and Hack-R!
Yes, these are my actual data. The data I am starting from are the following:
[A] #first, longer dataset
CODE_t2 VALUE_t2
111 3641
112 1691
121 1271
122 185
123 522
124 0
131 0
132 0
133 0
141 626
142 170
211 0
212 0
213 0
221 0
222 0
223 0
231 95
241 0
242 0
243 0
244 0
311 129
312 1214
313 0
321 0
322 0
323 565
324 0
331 0
332 0
333 0
334 0
335 0
411 0
412 0
421 0
422 0
423 0
511 6
512 0
521 0
522 0
523 87
In the above table, we can see the 44 land use CODES (which I inappropriately named "class" in my first entry) for a certain city. Some values are just 0, meaning that there are no land uses of that type in that city.
Starting from this table, which displays all the land use types for t2 and their corresponding values ("VALUE_t2") I have to reconstruct the previous amount of land uses ("VALUE_t1") per each type.
To do so, I have to add and subtract the value per each land use (if not 0) by using the "change land use table" from t2 to t1, which is the following:
[B] #second, shorter dataset
CODE_t2 CODE_t1 VALUE_CHANGE1
121 112 2
121 133 12
121 323 0
121 511 3
121 523 2
123 523 4
133 123 3
133 523 4
141 231 12
141 511 37
So, in order to get VALUE_t1 from VALUE_t2, I have, for instance, to subtract 2 + 12 + 0 + 3 + 2 hectares (first 5 values of the second, shorter table) from the value of land use type/code 121 of the first, longer table (1271 ha), and add 2 hectares to land type 112, 12 hectares to land type 133, 3 hectares to land type 511 and 2 hectares to land type 523. And I have to do that for all the land use types different than 0, and later also from t1 to t0.
What I have to do is a sort of loop that would both add and subtract, per each land use type/code, the values from VALUE_t2 to VALUE_t1, and from VALUE_t1 to VALUE_t0.
Once I estimated VALUE_t1 and VALUE_t0, I will put the values in a simple table showing the relative variation (here the values are not real):
CODE VALUE_t0 VALUE_t2 % VAR t2-t0
code1 50 100 ((100-50)/50)*100
code2 70 80 ((80-70)/70)*100
code3 45 34 ((34-45)/45)*100
What I could do so far is:
land_code <- names(A)[-1]
land_code
A$VALUE_t1 <- for(code in land_code{
cbind(A[1], A[land_code] - B[match(A$CODE_t2, B$CODE_t2), land_code])
}
If I use the loop I get an error, while if I take it away:
A$VALUE_t1 <- cbind(A[1], A[land_code] - B[match(A$CODE_t2, B$CODE_t2), land_code])
it works but I don't really get what I want to get... so far I was working on how to get a new column which would contain the new "add & subtract" values, but haven't succeeded yet. So I worked on how to get a new column which would at least match the land use types first, to then include the "add and subtract" formula.
Another problem is that, by using "match", I get a shorter A$VALUE_t1 table (13 rows instead of 44), while I would like to keep all the land use types in dataset A, because I will have then to match it with the table including VALUES_t0 (which I haven't shown here).
Sorry that I cannot do better than this at the moment... and I hope to have explained better what I have to do. I am extremely grateful for any help you can provide to me.
thanks a lot

Row wise operation on data.table

Let's say I'd like to calculate the magnitude of the range over a few columns, on a row-by-row basis.
set.seed(1)
dat <- data.frame(x=sample(1:1000,1000),
y=sample(1:1000,1000),
z=sample(1:1000,1000))
Using data.frame(), I would do something like this:
dat$diff_range <- apply(dat,1,function(x) diff(range(x)))
To put it more simply, I'm looking for this operation, over each row:
diff(range(dat[1,]) # for i 1:nrow(dat)
If I were doing this for the entire table, it would be something like:
setDT(dat)[,diff_range := apply(dat,1,function(x) diff(range(x)))]
But how would I do it for only named (or numbered) rows?
pmax and pmin find the min and max across columns in a vectorized way, which is much better than splitting and working with each row separately. It's also pretty concise:
dat[, r := do.call(pmax,.SD) - do.call(pmin,.SD)]
x y z r
1: 266 531 872 606
2: 372 685 967 595
3: 572 383 866 483
4: 906 953 437 516
5: 201 118 192 83
---
996: 768 945 292 653
997: 61 231 965 904
998: 771 145 18 753
999: 841 148 839 693
1000: 857 252 218 639
How about this:
D[,list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I][c(1:4,15:18)]
# I V1
#1: 1 971
#2: 2 877
#3: 3 988
#4: 4 241
#5: 15 622
#6: 16 684
#7: 17 971
#8: 18 835
#actually this will be faster
D[c(1:4,15:18),list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I]
use .I to give you an index to call with the by= parameter, then you can run the function on each row. The second call pre-filters by any list of row numbers, or you can add a key and filter on that if your real table looks different.
You can do it by subsetting before/during the function. If you only want every second row for example
dat_Diffs <- apply(dat[seq(2,1000,by=2),],1,function(x) diff(range(x)))
Or for rownames 1:10 (since their names weren't specified they are just numbers counting up)
dat_Diffs <- apply(dat[rownames(dat) %in% 1:10,],1,function(x) diff(range(x)))
But why not just calculate per row then subset later?

Resources