Error in Sub_data[1, i] : subscript out of bounds - r

I'm trying to write a program in R language and i use for loops and if statement
i have a data that contain 17 rows and 1091 columns (ariables)
I want to compare the values of a the 17th row and put the columns that have the same values in one data fram to treate them after
the algorithm i though of contain the following steps :
1-Take the column i want to compare and put it in new data frame (Sub_data)
2- compare the value in the 17th of this column with all the others values of other columns in the first data (All_data)
3-when the value of the column equal to the value of any other column (B) take that column B and add it to the data frame
4-after that i want to compare the variation of the variables in the Sub_data (that contains the same values of the 17th rows) and chose one column of the columns that has the same variation and eliminate the others
Here i present the rows and the first two columns of my data ( All_data)
MT95T843 MT95T756
QC_G.F9_01_4768 70027.0213162601 95774.1359666849
QC_G.F9_01_4765 69578.1863357392 81479.2957458262
QC_G.F9_01_4762 69578.1863357392 87021.9542724389
QC_G.F9_01_4759 68231.1433794304 95558.7673782843
QC_G.F9_01_4756 64874.1293568862 96780.772452217
QC_G.F9_01_4753 63866.6577969569 91854.3530432699
CtrF01R5_G.D1_01_4757 66954.3879935821 128861.361627886
CtrF01R4_G.D5_01_4763 97352.5522885788 101353.25926633
CtrF01R3_G.C8_01_4754 61311.7857641721 7603.60895516428
CtrF01R2_G.D3_01_4760 85768.3611731878 109461.75444564
CtrF01R1_G.C9_01_4755 85302.8194715206 104253.845374077
BtiF01R5_G.D7_01_4766 61252.4254487766 115683.737549183
BtiF01R4_G.D6_01_4764 81873.9637852956 112164.142293011
BtiF01R3_G.D2_01_4758 84981.2191408476 0
BtiF01R2_G.D4_01_4761 36629.0246187626 124806.491006666
BtiF01R1_G.D8_01_4767 0 109927.264246577
rt 13.9018138671285 13.9058590777331
Code for input dataframe :
df1 <- data.frame(Name = c("QC_G.F9_01_4768", "QC_G.F9_01_4765", "QC_G.F9_01_4762", "QC_G.F9_01_4759", "QC_G.F9_01_4756", "QC_G.F9_01_4753",
"CtrF01R5_G.D1_01_4757", "CtrF01R4_G.D5_01_4763", "CtrF01R3_G.C8_01_4754", "CtrF01R2_G.D3_01_4760", "CtrF01R1_G.C9_01_4755",
"BtiF01R5_G.D7_01_4766", "BtiF01R4_G.D6_01_4764", "BtiF01R3_G.D2_01_4758", "BtiF01R2_G.D4_01_4761", "BtiF01R1_G.D8_01_4767",
"rt"),
MT95T843 = c(70027.0213162601, 69578.1863357392, 69578.1863357392, 68231.1433794304, 64874.1293568862, 63866.6577969569, 66954.3879935821,
97352.5522885788, 61311.7857641721, 85768.3611731878, 85302.8194715206, 61252.4254487766, 81873.9637852956, 84981.2191408476,
36629.0246187626, 0, 13.9018138671285),
MT95T756 = c(95774.1359666849, 81479.2957458262, 87021.9542724389, 95558.7673782843, 96780.772452217, 91854.3530432699, 128861.361627886,
101353.25926633, 7603.60895516428, 109461.75444564, 104253.845374077, 115683.737549183, 112164.142293011, 0, 124806.491006666,
109927.264246577, 13.9058590777331))
df1
#> Name MT95T843 MT95T756
#> 1 QC_G.F9_01_4768 70027.02132 95774.13597
#> 2 QC_G.F9_01_4765 69578.18634 81479.29575
#> 3 QC_G.F9_01_4762 69578.18634 87021.95427
#> 4 QC_G.F9_01_4759 68231.14338 95558.76738
#> 5 QC_G.F9_01_4756 64874.12936 96780.77245
#> 6 QC_G.F9_01_4753 63866.65780 91854.35304
#> 7 CtrF01R5_G.D1_01_4757 66954.38799 128861.36163
#> 8 CtrF01R4_G.D5_01_4763 97352.55229 101353.25927
#> 9 CtrF01R3_G.C8_01_4754 61311.78576 7603.60896
#> 10 CtrF01R2_G.D3_01_4760 85768.36117 109461.75445
#> 11 CtrF01R1_G.C9_01_4755 85302.81947 104253.84537
#> 12 BtiF01R5_G.D7_01_4766 61252.42545 115683.73755
#> 13 BtiF01R4_G.D6_01_4764 81873.96379 112164.14229
#> 14 BtiF01R3_G.D2_01_4758 84981.21914 0.00000
#> 15 BtiF01R2_G.D4_01_4761 36629.02462 124806.49101
#> 16 BtiF01R1_G.D8_01_4767 0.00000 109927.26425
#> 17 rt 13.90181 13.90586
I'm stuck in the third step where i got this error message
Error in Sub_data[1, i] : subscript out of bounds
Here's the code i used :
library("readxl")
library("janitor")
All_data <- read_excel("DataMatrix_Excel.xlsx")
dim(All_data)
17 1091
for(i in 1:1091){
#Add column
Sub_data <- cbind(All_data[ , 1, drop=F])
for(j in 2:1091){
if(Sub_data[17,1]==All_data[17,j]) {
Sub_data <- cbind(Sub_data,All_data[ , j, drop=F])
#I added this line just to see if my code work
print(paste("The dim is " , dim(Sub_data)))
}
Please tell me if you need any more informations or clarification, also please tell me if you need any suggestions
Thank you very much

Related

Error in rowSums(select(Flower, !c(Ranunculus.repens, Ranunculus.acris, : 'x' must be numeric

df<- mutate(total = rowSums(select(Flower, !c(Ranunculus.repens, Ranunculus.acris, Ranunculus.ficaria, Trifolium.repens, Geranium.molle, Cardamine.flexuosa, Bellis.perennis, Taraxacum.officinalis, Lamium.purpureum, Glechoma.hederacea, Cardamine.pratensis, Medicago.lupulina, Medicago.arabica, Cerastium.fontanum, Prunella.vulgaris, Sonchus.arvensis, Veronica.persica, Veronica.chamaedrys, Viola.riviniana)), na.rm = TRUE))
I am trying to use this code to sum the rows of these specific columns but keep getting an X must be numeric error. I have checked the columns and they are all integers.
Does anyone know why this might be? I've included a sample of my dataset below.
Ranunculus.repens
Ranunculus.acris
Ranunculus.ficaria
Trifolium.repens
7
19
4
8
5
12
10
1
7

Group by name and add up the columns count in r

I have a dataset with 405 observations and 39 variables. But just two columns are important for further analysis.
I would like to group the first row with similar names together and add up their number from the second column.
Reproducible dataset looks like this:
df1 <- data.frame(name=c("Google Ads", "Google Doubleclick","Facebook Login",
"Facebook Ads","Twitter MoPub","Flurry","Amazon advertisment","Microsoft ","Ad4screen","imobi"),
value=c(10,20,30,40,50,60,70,80,90,100),unimportant=c(1,2,3,4,5,6,7,8,9,10))
Outcome should be in an new data.frame and look like this:
df2 <- data.frame (name=c("Google","Facebook","Twitter","Flurry","Amazon","Microsoft","Others"),
value=c(30,70,50,60,70,80,190))
A tidyverse way of doing it.
First store all valid_names in a vector say valid_names
Thereafter create a new column say all_names in df1 by -
first splitting all strings at space ' ' using str_split
thereafter use purrr::map_chr() to check if any of the split string matches with your valid_names and if yes, retrieve that string only otherwise get others
Thereafter group_by on this field. (I omitted one step of mutate first and then group_by and directly created the new field in group_by statement, that works)
Now summarise your important values as desired.
valid_names =c("Google","Facebook","Twitter","Flurry","Amazon","Microsoft")
valid_names
#> [1] "Google" "Facebook" "Twitter" "Flurry" "Amazon" "Microsoft"
df1 <- data.frame(name=c("Google Ads", "Google Doubleclick","Facebook Login",
"Facebook Ads","Twitter MoPub","Flurry","Amazon advertisment","Microsoft ","Ad4screen","imobi"),
value=c(10,20,30,40,50,60,70,80,90,100),unimportant=c(1,2,3,4,5,6,7,8,9,10))
df1
#> name value unimportant
#> 1 Google Ads 10 1
#> 2 Google Doubleclick 20 2
#> 3 Facebook Login 30 3
#> 4 Facebook Ads 40 4
#> 5 Twitter MoPub 50 5
#> 6 Flurry 60 6
#> 7 Amazon advertisment 70 7
#> 8 Microsoft 80 8
#> 9 Ad4screen 90 9
#> 10 imobi 100 10
library(tidyverse)
df1 %>% group_by(all_names = str_split(name, ' '),
all_names = map_chr(all_names, ~ ifelse(any(.x %in% valid_names),.x[.x %in% valid_names], 'others'))) %>%
summarise(value = sum(value), .groups = 'drop')
#> # A tibble: 7 x 2
#> all_names value
#> <chr> <dbl>
#> 1 Amazon 70
#> 2 Facebook 70
#> 3 Flurry 60
#> 4 Google 30
#> 5 Microsoft 80
#> 6 others 190
#> 7 Twitter 50
Created on 2021-06-22 by the reprex package (v2.0.0)
This works on the sample data using the adist function and with partial=TRUE to look at partial string matches. It requires defining the known groups though, rather than trying to find them. I think this leg work is worth doing though as it simplifies the problem a lot once the output is known
df1 <- data.frame(name=c("Google Ads", "Google Doubleclick","Facebook Login",
"Facbook Ads","Twitter MoPub","Flurry","Amazon advertisment","Microsoft ","Ad4screen","imobi"),
value=c(10,20,30,40,50,60,70,80,90,100),unimportant=c(1,2,3,4,5,6,7,8,9,10))
# types we want to map. known is the groupings
types <- unique(df1$name)
known <- c("Google","Facebook","Twitter","Flurry","Amazon","Microsoft")
# use distrance measures, and look for matches on partial strings eg
# ignore the Doubleclick part when matching on Google
distance <- adist(known, types, partial=TRUE)
# cap controls leniancy in matching e.g. Facbook and Facebook have a dist of 1
# whilst Facebook and Facebook is a perfect match with score of 0
# Raise to be more leniant
cap <- 1
# loop through the types
map_all <- sapply(seq_along(types), function(i){
# find minimum value, check if its below the cap. If so, assign to the closest
# group, else assign to others
v <- min(distance[,i])
if(v <= cap){
map_i <- known[which.min(distance[,i])]
}else{
map_i <- "Others"
}
map_i
})
# now merge in to df1, then sum out using your preferred method
df_map <- data.frame(name=types, group=map_all)
df_merged <- merge(df1, df_map, by="name")
df2 <- aggregate(value ~ group, sum, data=df_merged)
df2
group value
1 Amazon 70
2 Facebook 70
3 Flurry 60
4 Google 30
5 Microsoft 80
6 Others 190
7 Twitter 50

Correct variable values in a dataframe applying a function using variable-specific values in another dataframe in R

I have a df called 'covs' with sites on rows and in columns, 9 different environmental variables for each of these sites. I need to recalculate the value of each cell using the function x - center_values(x)) / scale_values(x). However, 'center_values' and 'scale_values' are different for each environmental covariate, and they are located in another df called 'correction'.
I have found many solutions for applying a function for a whole df, but not for applying specific values according to the id of the value to transform.
covs <- read.table(text = "X elev builtup river grip pa npp treecov
384879-2009 1 24.379101 25188.572 1241.8348 1431.1082 5.705152e+03 16536.664 60.23175
385822-2009 2 29.533478 32821.770 2748.9053 1361.7772 2.358533e+03 15773.115 62.38455
385823-2009 3 30.097059 28358.244 2525.7627 1073.8772 4.340906e+03 14899.451 46.03269
386765-2009 4 33.877861 40557.891 927.4295 1049.4838 4.580944e+03 15362.518 53.08151
386766-2009 5 38.605156 36182.801 1479.6178 1056.2130 2.517869e+03 13389.958 35.71379",
header= TRUE)
correction <- read.table(text = "var_name center_values scale_values
1 X 196.5 113.304898393671
2 elev 200.217889868483 307.718211316278
3 builtup 31624.4888660664 23553.2438790344
4 river 1390.41023742909 1549.88661649406
5 grip 5972.67361738244 6996.57793554527
6 pa 2731.33431010861 4504.71055521749
7 npp 10205.2997576655 2913.19658598938
8 treecov 47.9080656134352 17.7101565911347
9 nonveg 7.96755640452006 4.56625351682905", header= TRUE)
Could someone help me write a code to recalculate the environmental covariate values in 'covs' using the specific covariate values reported in 'correction'? E.g. For each value in the column 'elev' of the df 'covs', I need to substract the 'center_value' reported for 'elev' in the 'corrected' df, and then divided by the 'scale_value' of 'elev' reported in 'corrected' df. Thank you for your kind help.
You may assign var_name to row names, then loop over the names of covs to do the calculations in an sapply.
rownames(correction) <- correction$var_name
res <- as.data.frame(sapply(names(covs), function(x, y)
(covs[, x] - correction[x, "center_values"])/correction[x, "scale_values"]))
res
# X elev builtup river grip pa npp treecov
# 1 -1.725433 -0.5714280 -0.27324970 -0.09586213 -0.6491124 0.66015733 2.173339 0.6958541
# 2 -1.716607 -0.5546776 0.05083296 0.87651254 -0.6590217 -0.08275811 1.911239 0.8174114
# 3 -1.707781 -0.5528462 -0.13867495 0.73253905 -0.7001703 0.35730857 1.611340 -0.1058927
# 4 -1.698956 -0.5405596 0.37928543 -0.29871910 -0.7036568 0.41059457 1.770295 0.2921174
# 5 -1.690130 -0.5251972 0.19353224 0.05755748 -0.7026950 -0.04738713 1.093183 -0.6885470
Check e.g. "elev":
(covs[,"elev"] - correction["elev", "center_values"]) / correction["elev", "scale_values"]
# [1] -0.5714280 -0.5546776 -0.5528462 -0.5405596 -0.5251972

create a new column conditional on distance traveled between points in R

I am trying to create a new column conditional on another column, a bit like a moving average or moving window but based on distance between points. Take for example row 2 with a CO2 of 399.935. I would like to have the mean of all the points within 100 m (traveled) of that point. In my example (looking at column CumDist), rows 1, 3, 4, 5 would be selected to calculate the mean. The column CumDist (*100,000 to have the units in meters) consists of cumulative distance traveled. I have 5000 points and obviously the width (or the number of rows) of the moving window will vary.
I tested over() from the sp package, but it's problematic if the same road is taken more than once. I looked on the web for other solutions and I did not find anything that could help me.
dput(DF)
structure(list(CO2 = c(399.9350305, 399.9350305, 399.9350305,
400.0320031, 400.0320031, 400.0320031, 399.7718229, 399.7718229,
399.7718229, 399.3855075, 399.3855075, 399.3855075, 399.4708139,
399.4708139, 399.4708139, 400.0362474, 400.0362474, 400.0362474,
399.7556753, 399.7556753), lon = c(-103.7093538, -103.709352,
-103.7093492, -103.7093467, -103.7093455, -103.7093465, -103.7093482,
-103.7093596, -103.7094074, -103.7094625, -103.7094966, -103.709593,
-103.709649, -103.7096717, -103.7097349, -103.7097795, -103.709827,
-103.7099007, -103.709924, -103.7099887), lat = c(49.46972027,
49.46972153, 49.46971675, 49.46971533, 49.46971307, 49.4697124,
49.46970636, 49.46968214, 49.46960921, 49.46955984, 49.46953621,
49.46945809, 49.46938994, 49.46935281, 49.46924309, 49.46918635,
49.46914762, 49.46912566, 49.46912407, 49.46913321),distDiff = c(0.000342016147509882,
0.000191466419697602, 0.000569046320857002, 0.000240367540492089,
0.000265977754839834, 0.000103953049523505, 0.000682968856240796,
0.0028176007969857, 0.00882013898948418, 0.00678966015562509,
0.00360774024245839, 0.011149423290729, 0.00859796340323456,
0.00444526066124642, 0.0130344010874029, 0.00709037369666853,
0.00551435348701512, 0.00587377717110946, 0.00169806309901329,
0.00479849401022625), CumDist = c(0.000342016147509882, 0.000533482567207484,
0.00110252888806449, 0.00134289642855657, 0.00160887418339641,
0.00171282723291991, 0.00239579608916071, 0.00521339688614641,
0.0140335358756306, 0.0208231960312557, 0.0244309362737141, 0.0355803595644431,
0.0441783229676777, 0.0486235836289241, 0.0616579847163269, 0.0687483584129955,
0.0742627119000106, 0.08013648907112, 0.0818345521701333, 0.0866330461803596
)), .Names = c("X12CO2_dry", "coords.x1", "coords.x2", "V1",
"CumDist"), row.names = 2:21, class = "data.frame")
thanks, Martin
Man you beat me to it with a cleaner solution mra68.
Here's mine using a few loops.
####################
for (j in 1:nrow(DF)){#Loop through all rows of your dataset
CO2list<-NULL ##Need to make a variable before storing to it in the loop
for(i in 1:nrow(DF)){##Loop through all distances in the table
if ((abs(DF$CumDist[i]-DF$CumDist[j]))<=0.001) {
##Check to see if difference in CumDist<=100/100000 for all entries
#CumDist[j] is point with the 100 meter window around it
CO2list<-c(CO2list,DF$X12CO2_dry[i])
##Store your CO2 entries that are within the 100 meter window to a vector
}
}
DF$CO2AVG[j]<-mean(CO2list)
#Get the mean of your list and store it to column named CO2AVG
}
The window that belongs to the i-th row starts at n[i] and ends at m[i]-1. Hence the sum of the CO2-values in the i-th window is CumCO2[m[i]]-CumCO2[n[i]]. (Notice that the indices in CumCO2 are shifted by 1, because of the leading 0.) Dividing this CO2-sum by the window size m[i]-n[i] gives the values meanCO2 for the new column:
n <- sapply( df$CumDist,
function(x){
which.max( df$CumDist >= x-0.001 )
}
)
m <- sapply( df$CumDist,
function(x){
which.max( c(df$CumDist,Inf) > x+0.001 )
}
)
CumCO2 <- c( 0, cumsum(df$X12CO2) )
meanCO2 <- ( CumCO2[m] - CumCO2[n] ) / (m-n)
.
> n
[1] 1 1 1 2 3 3 5 8 9 10 11 12 13 14 15 16 17 18 19 20
> m
[1] 4 5 7 7 8 8 8 9 10 11 12 13 14 15 16 17 18 19 20 21
> meanCO2
[1] 399.9350 399.9593 399.9835 399.9932 399.9606 399.9606 399.9453 399.7718 399.7718 399.3855 399.3855 399.3855 399.4708 399.4708 399.4708 400.0362
[17] 400.0362 400.0362 399.7557 399.7557
>

R and appending to data frames

I have some cross correlation function crosscor, and I would like to loop through the function for each of the columns I have in my data matrix. The function outputs some cross correlation that looks something like this each time it is run:
Lags Cross.Correlation P.value
1 0 -0.0006844958 0.993233547
2 1 0.1021006478 0.204691627
3 2 0.0976746274 0.226628526
4 3 0.1150337867 0.155426784
5 4 0.1943150900 0.016092041
6 5 0.2360415470 0.003416147
7 6 0.1855274375 0.022566685
8 7 0.0800646242 0.330081900
9 8 0.1111071269 0.177338885
10 9 0.0689602574 0.404948252
11 10 -0.0097332533 0.906856279
12 11 0.0146241719 0.860926388
13 12 0.0862549791 0.302268025
14 13 0.1283308019 0.125302070
15 14 0.0909537922 0.279988895
16 15 0.0628012627 0.457795228
17 16 0.1669241304 0.047886605
18 17 0.2019811994 0.016703619
19 18 0.1440124960 0.090764520
20 19 0.1104842808 0.197035340
21 20 0.1247428178 0.146396407
I would like put all of the lists together so they are in a data frame, and ultimately export it into a csv file so the columns are as follows: lags.3, cross-correlation.3, p-value.3, lags.3, cross-correlation.2....etc. until p.value.50.
I have tried to use do.call as follows, but have not been successful:
for(i in 3:50)
{
l1<-crosscor(data[,2], data[,i], lagmax=20)
ccdata<-do.call(rbind, l1)
cat("Data row", i)
}
I've also tried just creating the data frame straight out, but am just getting the lag column names:
ccdata <- data.frame()
for(i in 3:50)
{
ccdata[i-2:i+1]<-crosscor(data[,2], data[,i], lagmax=20)
cat("Data row", i)
}
What am I doing wrong? Or is there an online source on data sets I could access to figure out how to do this? Best,
There is a transpose method for data.frames. If "crosscor" is the name of the object just try this:
tcrosscor <- t(crosscor)
write.csv(tcrosscor, file="my_crosscor_1.csv")
The first row would be the Lag's; the second row, the Cross.Correlation's; the third row the P.value's. I suppose you could "flatten" it further so it would be entirely "horizontal" or "wide". Seems painful but this might go something like:
single_line <- as.data.frame(unlist(tcrosscor))
names(single_line) <- paste("Lag", 'Cross.Correlation', 'P.value'), rep(1:50, 3), sep=".")
write.csv(single_line, file="my_single_1.csv")

Resources