Calculating R combinations from CSV file [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a CSV file which contains about 400 values ranging from 10 000 to 50 000.
I want to calculate what combinations of selected values, for example 100, 150, 200,250 correspond to values in CSV file.
Is it possible to do it in R?
So this is part of the data:
1359.214844
1604.558594
1701.759766
1761.083984
1792.990234
1926.248047
1958.144531
2086.373047
2114.501953
2142.542969
2204.325621
2216.468750
2229.136719
2286.894531
2302.847656
2379.826172
2395.039063
2467.578125
2610.802734
2797.929688
2812.916016
2838.947266
2979.498047
3122.171875
3163.671875
3457.794922
3809.228516
3826.058594
3952.609375
3983.210938
4102.996094
Second data set is (146.058, 203.193, 162.053, 291.095)
I need possible combinations of second data set that corresponds to values in the first on. For example 291*2+162*5+203*4 = 2204.

There will be alternative ways to do that, like a loop that checks a specific combination at iteration i and decides to keep it or ignore it, but I prefer not to use loops when possible.
library(dplyr)
dt = read.table(text = "1359.214844
1604.558594
1701.759766
1761.083984
1792.990234
1926.248047
1958.144531
2086.373047
2114.501953
2142.542969
2204.325621
2216.468750
2229.136719
2286.894531
2302.847656
2379.826172
2395.039063
2467.578125
2610.802734
2797.929688
2812.916016
2838.947266
2979.498047
3122.171875
3163.671875
3457.794922
3809.228516
3826.058594
3952.609375
3983.210938
4102.996094")
# change column name and round values
names(dt) = "value"
dt$value = round(dt$value)
# give the manual values (assuming they are 4 values)
manual_values = c(146.058, 203.193, 162.053, 291.095)
# round values
manual_values = round(manual_values)
# get the maximum coefficient to investigate
coeff = ceiling(max(dt$value) / min(manual_values))
expand.grid(v1 = manual_values[1], ## create all combinations of coefficients and keep your values
v2 = manual_values[2],
v3 = manual_values[3],
v4 = manual_values[4],
coeff1 = 0:coeff,
coeff2 = 0:coeff,
coeff3 = 0:coeff,
coeff4 = 0:coeff) %>%
mutate(value = v1*coeff1+v2*coeff2+v3*coeff3+v4*coeff4) %>% ## calculate the value from each combination
inner_join(dt, by="value") ## join info from your initial values
## sample of the first 10 rows of the result :
# v1 v2 v3 v4 coeff1 coeff2 coeff3 coeff4 value
# 1 146 203 162 291 3 10 0 0 2468
# 2 146 203 162 291 7 12 0 0 3458
# 3 146 203 162 291 9 13 0 0 3953
# 4 146 203 162 291 7 3 1 0 1793
# 5 146 203 162 291 22 3 1 0 3983
# 6 146 203 162 291 15 4 1 0 3164
# 7 146 203 162 291 4 5 1 0 1761
# 8 146 203 162 291 0 11 1 0 2395
# 9 146 203 162 291 4 11 1 0 2979
# 10 146 203 162 291 2 14 2 0 3458
So, the first line of the output tells you that the combination 3*146 + 10*203 equals 2468, which is a value that exists in your initial dataset (CSV).
If you spot any bugs, or you need any clarifications let me know and I can update my answer.
A small improvement could be to replace the last inner_join with filter(value %in% dt$value). I don't think there's any reason to join when you can get the same output by using a filtering command.
For your other objective (specified in the comments) try this:
library(dplyr)
dt = read.table(text = "1359.214844
1604.558594
1701.759766
1761.083984
1792.990234
1926.248047
1958.144531
2086.373047
2114.501953
2142.542969
2204.325621
2216.468750
2229.136719
2286.894531
2302.847656
2379.826172
2395.039063
2467.578125
2610.802734
2797.929688
2812.916016
2838.947266
2979.498047
3122.171875
3163.671875
3457.794922
3809.228516
3826.058594
3952.609375
3983.210938
4102.996094")
# change column name and round values
names(dt) = "value"
dt$value = round(dt$value)
# give the manual values (assuming they are 4 values)
manual_values = c(146.058, 203.193, 162.053, 291.095)
# get the maximum coefficient to investigate
coeff = ceiling(max(dt$value) / min(manual_values))
expand.grid(v1 = manual_values[1], ## create all combinations of coefficients and keep your values
v2 = manual_values[2],
v3 = manual_values[3],
v4 = manual_values[4],
coeff1 = 0:3,
coeff2 = 5:coeff,
coeff3 = 5:coeff,
coeff4 = 0:3) %>%
mutate(SUM = v1*coeff1+v2*coeff2+v3*coeff3+v4*coeff4) %>% ## calculate the value of each combination
tbl_df() ## only for printing top 10 rows
# v1 v2 v3 v4 coeff1 coeff2 coeff3 coeff4 SUM
# (dbl) (dbl) (dbl) (dbl) (int) (int) (int) (int) (dbl)
# 1 146.058 203.193 162.053 291.095 0 5 5 0 1826.230
# 2 146.058 203.193 162.053 291.095 1 5 5 0 1972.288
# 3 146.058 203.193 162.053 291.095 2 5 5 0 2118.346
# 4 146.058 203.193 162.053 291.095 3 5 5 0 2264.404
# 5 146.058 203.193 162.053 291.095 0 6 5 0 2029.423
# 6 146.058 203.193 162.053 291.095 1 6 5 0 2175.481
# 7 146.058 203.193 162.053 291.095 2 6 5 0 2321.539
# 8 146.058 203.193 162.053 291.095 3 6 5 0 2467.597
# 9 146.058 203.193 162.053 291.095 0 7 5 0 2232.616
# 10 146.058 203.193 162.053 291.095 1 7 5 0 2378.674
# .. ... ... ... ... ... ... ... ... ...
You can save this result table as a data frame and continue your process as you like.

Related

Why does the frequency reduce if I use ifelse function in R?Is there a way to create categories from the combination of 2 variables/columns?

when I do
table(df$strategy.x)
0 1 2 3
70 514 223 209
table(df$strategy.y)
0 1 2 3
729 24 7 4
I want to create a variable with both of these combined. I tried this
df <- df %>%
mutate(nstrategy1 = ifelse(strategy.x==1| strategy.y==1 , 1, 0))
table(df$nstrategy1)
0 1
399 519
I am supposed to get 514 + 24 = 538 but I got 519 instead
df <- df %>% mutate(nstrategy2 = ifelse(strategy.x==2| strategy.y==2 , 1, 0))
table(df$nstrategy2)
0 1
578 228
Similarly, I am supposed to get 223 + 7 = 230, but I got 228 instead
Is there a good way to merge both strategy.x and strategy.y and end up with a table like the following with 4 categories?
0 1 2 3
799 538 230 213
table(mtcars$am) # 13 1's
table(mtcars$vs) # 14 1's
mtcars$ones = ifelse(mtcars$am == 1 | mtcars$vs == 1, 1, 0)
table(mtcars$ones) # 20 1's < 13 + 14 = 27
Why is it showing only 20 1's instead of 27? It's because there are 7 + 6 + 7 = 20 cars with either one or two 1's in am and vs. There are 13 with am==1 (6+7), and 14 with vs==1 (7+7). Seven cars are in the bottom left because they have 1's in both dimensions, which you are expecting/seeking to count twice.
table(mtcars$am, mtcars$vs)
# 0 1
# 0 12 7
# 1 6 7
The simplest way to get the sum of the two results would be by adding the two table objects:
table(mtcars$am) + table(mtcars$vs)
# 0 1
# 37 27

How to use arguments specified in a user-created R function?

this seems like a basic question; however, I am not sure if I am unable to word my question to search for the answer that I need.
This is the sample:
id2 sbp1 dbp1 age1 sbp2 dbp2 sex bmi1 bmi2 smoke drink exercise
1 1 134.5 89.5 40 146 84 2 21.74685 22.19658 1 0 1
2 4 128.5 89.5 48 125 70 1 24.61942 22.29476 1 0 0
3 5 105.5 64.5 42 121 80 2 22.15103 26.90204 1 0 0
4 8 116.5 79.5 39 107 72 2 21.08032 27.64403 0 0 1
5 9 106.5 73.5 26 132 81 2 21.26762 29.16131 0 0 0
6 10 120.5 81.5 34 130 85 1 24.91663 26.89427 1 1 0
I have this code here for a function I am making:
linreg.ols<- function(indat, dv, p1, p2, p3){
data<- read.csv(file= indat, header=T)
data[1:5,]
y<- data$dv
x <- as.matrix(data.frame(x0=rep(1,nrow(data)), x1=data$p1, x2=data$p2,
x3=data$p3))
inv<- solve(t(x)%*%x)
xy<- t(x)%*%y
betah<- inv%*%xy
print("Value of beta hat")
betah
}
And when I run my code with this line:
linreg.ols("bp.csv",sbp1,smoke,drink,exercise)
I get the following error:
Error in data.frame(x0 = rep(1, nrow(data)), x1 = data$p1, x2 = data$p2, :
arguments imply differing number of rows: 75, 0
I have a feeling that it's because of how I am extracting the p1, p2, and p3 columns on the line where I create the x variable.
EDIT: changed to y<-data$dv
EDIT: added on part of the sample. Also, I tried:
x <- as.matrix(data.frame(1,data[,c("p1","p2","p3")]))
But that returned the error:
Error in `[.data.frame`(data, , c("p1", "p2", "p3")) : undefined columns selected

Sorting elements by column in R

I have a simple code for matrix
ind1=which(macierz==1,arr.ind = TRUE)
fragment of theresult is
> ind1
row col
TCGA.CH.5737.01 53 1
TCGA.CH.5791.01 66 1
P03.1334.Tumor 322 1
P04.1790.Tumor 327 1
CPCG0340.F1 425 1
TCGA.CH.5737.01 53 2
TCGA.CH.5791.01 66 2
P03.1334.Tumor 322 2
P04.1790.Tumor 327 2
CPCG0340.F1 425 2
I would like to sort it by first column alphabetical. How can I do this in R?
It looks as if ind1 is a matrix and the first column is the rownames, so you probably need something like ind1 <- ind1[order(rownames(ind1)),]
You need (assuming your first column is called "label" and those are not rownames)
ind1[order(ind1$label),]
order() return a list of row indexes after sorting alphabetically the data frame. Just to make the example reproducible I created your data frame so
ind1 <- data.frame ( label = c("TCGA.CH.5737.01", "TCGA.CH.5791.01",
"P03.1334.Tumor","P04.1790.Tumor", "CPCG0340.F1" , "TCGA.CH.5737.01",
"TCGA.CH.5791.01","P03.1334.Tumor", "P04.1790.Tumor", "CPCG0340.F1"),
row = c(53,66,322,327,425,53,66,322,327,425), col =
c(1,1,1,1,1,2,2,2,2,2),
stringsAsFactors = FALSE)
and the output is
> ind1[order(ind1$label),]
label row col
5 CPCG0340.F1 425 1
10 CPCG0340.F1 425 2
3 P03.1334.Tumor 322 1
8 P03.1334.Tumor 322 2
4 P04.1790.Tumor 327 1
9 P04.1790.Tumor 327 2
1 TCGA.CH.5737.01 53 1
6 TCGA.CH.5737.01 53 2
2 TCGA.CH.5791.01 66 1
7 TCGA.CH.5791.01 66 2
Hope that helps.
Regards, Umberto

how to select data based on a list from a split data frame and then recombine in R

I am trying to do the following. I have a dataset Test:
Item_ID Test_No Category Sharpness Weight Viscocity
132 1 3 14.93199362 94.37250417 579.4236727
676 1 4 44.58750591 70.03232054 1829.170727
699 2 5 89.02760079 54.30587287 1169.226863
850 3 6 30.74535903 83.84377678 707.2280513
951 4 237 67.79568019 51.10388484 917.6609965
1031 5 56 74.06697003 63.31274502 1981.17804
1175 4 354 98.9656142 97.7523884 100.7357981
1483 5 726 9.958040999 51.29537311 1222.910211
1529 7 800 64.11430235 65.69780939 573.8266137
1698 9 125 67.83105185 96.53847341 486.9620194
1748 9 1005 49.43602318 52.9139591 1881.740184
2005 9 28 26.89821508 82.12663209 1709.556135
2111 2 76 83.03593144 85.23622731 276.5088502
I would want to split this data based on Test_No and then compute the number of unique Category per Test_No and also the Median Category value. I chose to use split and Sappply in the following way. But, I am getting an error regarding a missing parenthesis. Is there anything wrong in my approach ? Please find my code below:
function(CatRange){
c(Cat_Count = length(unique(CatRange$Category)), Median_Cat = median(unique(CatRange$Category), na.rm = TRUE) )
}
CatStat <- do.call(rbind,sapply(split(Test, Test$Test_No), function(ModRange)))
Appending my question:
I would want to display the data containing the following information:
Test_No, Category, Median_Cat and Cat_Count
We can try with dplyr
library(dplyr)
Test %>%
group_by(Test_No) %>%
summarise(Cat_Count = n_distinct(Category),
Median_Cat = median(Category,na.rm = TRUE),
Category = toString(Category))
# Test_No Cat_Count Median_Cat Category
# <int> <int> <dbl> <chr>
#1 1 2 3.5 3, 4
#2 2 2 40.5 5, 76
#3 3 1 6.0 6
#4 4 2 295.5 237, 354
#5 5 2 391.0 56, 726
#6 7 1 800.0 800
#7 9 3 125.0 125, 1005, 28
Or if you prefer base R we can also try with aggregate
aggregate(Category~Test_No, CatRange, function(x) c(Cat_Count = length(unique(x)),
Median_Cat = median(x,na.rm = TRUE), Category = toString(x)))
As far as the function written is concerned I think there are some synatx issues in it.
new_func <- function(CatRange){
c(Cat_Count = length(unique(CatRange$Category)),
Median_Cat = median(unique(CatRange$Category), na.rm = TRUE),
Category = toString(CatRange$Category))
}
data.frame(t(sapply(split(CatRange, CatRange$Test_No), new_func)))
# Cat_Count Median_Cat Category
#1 2 3.5 3, 4
#2 2 40.5 5, 76
#3 1 6 6
#4 2 295.5 237, 354
#5 2 391 56, 726
#7 1 800 800
#9 3 125 125, 1005, 28

R: sum rows from column A until conditioned value in column B

I'm pretty new to R and can't seem to figure out how to deal with what seems to be a relatively simple problem. I want to sum the rows of the column 'DURATION' per 'TRIAL_INDEX', but then only those first rows where the values of 'X_POSITION" are increasing. I only want to sum the first round within a trial where X increases.
The first rows of a simplified dataframe:
TRIAL_INDEX DURATION X_POSITION
1 1 204 314.5
2 1 172 471.6
3 1 186 570.4
4 1 670 539.5
5 1 186 503.6
6 2 134 306.8
7 2 182 503.3
8 2 806 555.7
9 2 323 490.0
So, for TRIAL_INDEX 1, only the first three values of DURATION should be added (204+172+186), as this is where X has the highest value so far (going through the dataframe row by row).
The desired output should look something like:
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
I tried to use dplyr, to generate a new dataframe that can be merged with my original dataframe.
However, the code doesn't work, and also I'm not sure on how to make sure it's only adding the first rows per trial that have increasing values for X_POSITION.
FirstPassRT = dat %>%
group_by(TRIAL_INDEX) %>%
filter(dplyr::lag(dat$X_POSITION,1) > dat$X_POSITION) %>%
summarise(FIRST_PASS_TIME=sum(DURATION))
Any help and suggestions are greatly appreciated!
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
# find the rows that will be used for summing DURATION
idx = dt[, .I[1]:.I[min(.N, which(diff(X_POSITION) < 0), na.rm = T)], by = TRIAL_INDEX]$V1
# sum the DURATION for those rows
dt[idx, time := sum(DURATION), by = TRIAL_INDEX][, time := time[1], by = TRIAL_INDEX]
dt
# TRIAL_INDEX DURATION X_POSITION time
#1: 1 204 314.5 562
#2: 1 172 471.6 562
#3: 1 186 570.4 562
#4: 1 670 539.5 562
#5: 1 186 503.6 562
#6: 2 134 306.8 1122
#7: 2 182 503.3 1122
#8: 2 806 555.7 1122
#9: 2 323 490.0 1122
Here is something you can try with dplyr package:
library(dplyr);
dat %>% group_by(TRIAL_INDEX) %>%
mutate(IncLogic = X_POSITION > lag(X_POSITION, default = 0)) %>%
mutate(FIRST_PASS_TIME = sum(DURATION[IncLogic])) %>%
select(-IncLogic)
Source: local data frame [9 x 4]
Groups: TRIAL_INDEX [2]
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
(int) (int) (dbl) (int)
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
If you want to summarize it down to one row per trial you can use summarize like this:
library(dplyr)
df <- data_frame(TRIAL_INDEX = c(1,1,1,1,1,2,2,2,2),
DURATION = c(204,172,186,670, 186,134,182,806, 323),
X_POSITION = c(314.5, 471.6, 570.4, 539.5, 503.6, 306.8, 503.3, 555.7, 490.0))
res <- df %>%
group_by(TRIAL_INDEX) %>%
mutate(x.increasing = ifelse(X_POSITION > lag(X_POSITION), TRUE, FALSE),
x.increasing = ifelse(is.na(x.increasing), TRUE, x.increasing)) %>%
filter(x.increasing == TRUE) %>%
summarize(FIRST_PASS_TIME = sum(X_POSITION))
res
#Source: local data frame [2 x 2]
#
# TRIAL_INDEX FIRST_PASS_TIME
# (dbl) (dbl)
#1 1 1356.5
#2 2 1365.8

Resources