I am trying to delete rows in my dataset, which contains NAs, but none of the functions work, What could be a reason?
Here is sample of my code,
Site_cov<- read.csv("site_cov.csv")
colnames(Site_cov)<- c("Point", "Basal", "Short.Saps", "Tall.Saps")
head(Site_cov)
Point Basal Short.Saps Tall.Saps
1 DEL001 Na 2 0
2 DEL002 Na 1 6
3 DEL003 Na 0 5
4 DEL004 10 21 22
Here, I though that upper and lower case Nas, could be a problem and this is what I run,
Site_cov$Basal<-toupper(Site_cov$Basal)
Site_cov$Short.Saps<-toupper(Site_cov$Short.Saps)
Site_cov$Tall.Saps<-toupper(Site_cov$Tall.Saps)
Then, I try to delete NAs
Site_cov_NA <- Site_cov[complete.cases(Site_cov[ , c("Point", "Basal", "Short.Saps", "Tall.Saps")]), ]
But, NAs are still here
head(Site_cov_NA)
Point Basal Short.Saps Tall.Saps
1 DEL001 NA 2 0
2 DEL002 NA 1 6
3 DEL003 NA 0 5
4 DEL004 10 21 22
5 DEL005 60 8 17
6 DEL006 80 17 13
Obviously you have 'Na' strings that are fake NAs. replace them with real ones, then your code should work.
dat <- replace(dat, dat == 'Na', NA)
dat[complete.cases(dat[, c("Point", "Basal", "Short.Saps", "Tall.Saps")]), ]
# Point Basal Short.Saps Tall.Saps
# 4 DEL004 10 21 22
Data:
dat <- structure(list(Point = c("DEL001", "DEL002", "DEL003", "DEL004"
), Basal = c("Na", "Na", "Na", "10"), Short.Saps = c(2L, 1L,
0L, 21L), Tall.Saps = c(0L, 6L, 5L, 22L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))
Try the complete.cases() function (https://stat.ethz.ch/R-manual/R-patched/library/stats/html/complete.cases.html)
try <- data.frame("a"=c(1,3,NA,NA), "b"=c(3,5,2,3))
try1<-try[complete.cases(try),]
try1
Related
I ran the following Indicator Species Analysis (indval) code from labdsv package in R on a dataframe called "data" where species abundances are columns and sites are rows as below:
Site Species X Species Y Species Z etc
1 10 3 5
2 5 15 220
3 0 1 0
4 21 100 3
In a separate file is the corresponding Group data for each site which is either group 1 or group 2 (called this spe.grp), that is the following:
Groups
1
2
1
2
I removed categorical variables so that spe.only has only the species data
spe.only <- data[,2:1521]
I then removed species which do not occur in any sample
spe.only[, (!apply(spe.only==0,2,all))]
I then ran Indicator species based on Groups (1) or (2)
(iva <- indval(spe.only, spe.grp$Groups))
But I get
"Error in indval.default(spe.only, spe.grp$Status) : All species
must occur in at least one plot"
How do I resolve this error so that I can run indval correctly?
The step
spe.only[, (!apply(spe.only==0,2,all))]
was not assigned back to the original object i.e. if we don't assign it back it, the output from the above step only prints on the console and not updates the original object
spe.only <- spe.only[, (!apply(spe.only==0,2,all))]
Now do the indval
> library(labdsv)
> indval(spe.only, spe.grp$Groups)
$relfrq
1 2
SpeciesX 0.5 1
SpeciesY 1.0 1
SpeciesZ 0.5 1
$relabu
1 2
SpeciesX 0.27777778 0.7222222
SpeciesY 0.03361345 0.9663866
SpeciesZ 0.02192982 0.9780702
$indval
1 2
SpeciesX 0.13888889 0.7222222
SpeciesY 0.03361345 0.9663866
SpeciesZ 0.01096491 0.9780702
$maxcls
SpeciesX SpeciesY SpeciesZ
2 2 2
$indcls
SpeciesX SpeciesY SpeciesZ
0.7222222 0.9663866 0.9780702
$pval
SpeciesX SpeciesY SpeciesZ
0.678 0.319 0.671
The error is reproducible on the original 'spe.only' object
> indval(spe.only, spe.grp$Groups)
Error in indval.default(spe.only, spe.grp$Groups) :
All species must occur in at least one plot
data
spe.only <- structure(list(SpeciesX = c(10L, 5L, 0L, 21L), SpeciesY = c(3L,
15L, 1L, 100L), SpeciesZ = c(5L, 220L, 0L, 3L), SpeciesD = c(0,
0, 0, 0)), row.names = c(NA, -4L), class = "data.frame")
spe.grp <- structure(list(Groups = c(1, 2, 1, 2)),
class = "data.frame", row.names = c(NA,
-4L))
Note: This question is a follow up to a previous question: r - Finding closest coordinates between two large data sets.
I am aiming to identify the nearest entry in dataset 2 to each entry in dataset 1 based on the coordinates in both datasets. Dataset 1 contains 180,000 rows (only 1,800 unique coordinates) and dataset 2 contains contains 4,500 rows (full 4,500 unique coordinates).
The previously referenced post contains a solution the problem, however it uses RANN::nn2 which uses Euclidean distance as opposed to the aim of using Ellipsoidal/Vincenty.
Current code:
df1[ , c(4,5)] <- as.data.frame(RANN::nn2(df2[,c(2,3)],df1[,c(2,3)],k=1))
df1[,4] <- df2[df1[, 4], 1]
# id HIGH_PRCN_LAT HIGH_PRCN_LON SRC_ID distance
# 1 1 52.88144 -2.873778 44 0.7990743
# 2 2 57.80945 -2.234544 5688 2.1676868
# 3 4 34.02335 -3.098445 61114 1.4758202
# 4 5 63.80879 -2.439163 23 4.2415854
# 5 6 53.68881 -7.396112 54 3.6445416
# 6 7 63.44628 -5.162345 23 2.3577811
# 7 8 21.60755 -8.633113 440 8.2123762
# 8 9 78.32444 3.813290 76 11.4936496
# 9 10 66.85533 -3.994326 55 1.9296370
# 10 3 51.62354 -8.906553 54 3.2180026
I suspect that the solution would involve geosphere::distVincentyEllipsoid but I am unsure as to how to integrate it into the existing code.
Data:
r details
platform x86_64-w64-mingw32
version.string R version 3.5.3 (2019-03-11)
data set 1 input (not narrowed down to unique coordinates)
df1 <- structure(list(id = c(1L, 2L, 4L, 5L,
6L, 7L, 8L, 9, 10L, 3L),
HIGH_PRCN_LAT = c(52.881442267773, 57.8094538200198, 34.0233529,
63.8087900198, 53.6888144440184, 63.4462810678651, 21.6075544376207,
78.324442654172, 66.85532539759495, 51.623544596), HIGH_PRCN_LON = c(-2.87377812157822,
-2.23454414781635, -3.0984448341, -2.439163178635, -7.396111601421454,
-5.162345043546359, -8.63311254098095, 3.813289888829932,
-3.994325961186105, -8.9065532453272409), SRC_ID = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), distance = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 10L), class = "data.frame")
data set 2 input
df2 <- structure(list(SRC_ID = c(55L, 54L, 23L, 11L, 44L, 21L, 76L,
5688L, 440L, 61114L), HIGH_PRCN_LAT = c(68.46506, 50.34127, 61.16432,
42.57807, 52.29879, 68.52132, 87.83912, 55.67825, 29.74444, 34.33228
), HIGH_PRCN_LON = c(-5.0584, -5.95506, -5.75546, -5.47801, -3.42062,
-6.99441, -2.63457, -2.63057, -7.52216, -1.65532)), row.names = c(NA,
10L), class = "data.frame")
Using distVincentyEllipsoid function:
library(geosphere)
t(
apply(
apply(df1[,c(3,2)], 1, function(mrow){distVincentyEllipsoid(mrow, df2[,c(3,2)])}),
2, function(x){ c(SRC_ID=df2[which.min(x),1],distance=min(x))}
)
)
SRC_ID distance
1 44 74680.48
2 5688 238553.51
3 61114 137385.18
4 23 340642.70
5 44 308458.73
6 23 256176.88
7 440 908292.28
8 76 1064419.47
9 55 185119.29
10 54 251580.45
Just use df1[,c(4,5)] <- t(apply(... to assign the values to the column of df1
Using rgeos::gDistance. This is Cartesian distance but starting from the solution below, I managed to post the updated answer above;
library(sp);library(rgeos)
#convert to spatial datasets
df1rgsp <- SpatialPointsDataFrame(df1[,c(3,2)], df1[,-c(3,2)])
df2rgsp <- SpatialPointsDataFrame(df2[,c(3,2)], data.frame(SRC_ID=df2[,1]))
#apply it on each rows
#find the minimum value and the corresponding row number
#transform it to become to columns and assign it to the columns of `df1`
df1[,c(4,5)] <- t( apply(gDistance(df1rgsp, df2rgsp, byid=TRUE), 1, function(x){
c(SRC_ID=which.min(x),distance=min(x))}))
#replace row numbers with `SRC_ID
df1[,4] <- df2[as.integer(df1[, 4]), 1] #same as what you have in the Q
# id HIGH_PRCN_LAT HIGH_PRCN_LON SRC_ID distance
# 1 1 52.88144 -2.873778 440 1.9296370
# 2 2 57.80945 -2.234544 61114 3.2180026
# 3 4 34.02335 -3.098445 21 2.3577811
# 4 5 63.80879 -2.439163 23 8.8794997
# 5 6 53.68881 -7.396112 55 0.7990743
# 6 7 63.44628 -5.162345 440 3.4316239
# 7 8 21.60755 -8.633113 5688 11.4936496
# 8 9 78.32444 3.813290 54 2.1676868
# 9 10 66.85533 -3.994326 23 6.1545391
# 10 3 51.62354 -8.906553 23 1.4758202
I have data like this
Time chamber
9 1
10 2
11 3
12 4
13 5
14 6
15 7
16 8
17 9
18 10
19 11
20 12
21 1
22 2
23 3
24 4
I want to create a new column using conditions on another existing column (chamber).
It should look something like this
Time chamber treatment
9 1 c2t2
10 2 c2t2
11 3 c0t0r
12 4 c2t2r
13 5 c2t2r
14 6 c0t0
15 7 c0t0r
16 8 c0t0r
17 9 c2t2
18 10 c2t2r
19 11 c0t0
20 12 c0t0
21 1 c2t2
22 2 c2t2
23 3 c0t0r
24 4 c2t2r
For chambers 1,2,9: Treatment is c2t2
For chambers 3,7,8: Treatment is c0t0r.
For chambers 4,5,10: Treatment is c2t2r
For chambers 6,11,12: Treatment is c0t0.
I have also made a lookup table, but I don't know how to use it:
lookup_table <- data.frame(row.names = c("1", "2", "3","4", "5", "6","7", "8", "9","10", "11", "12"),
new_col = c("C2T2", "C2T2", "C0T0R","C2T2R", "C2T2R", "C0T0","C0T0R", "C0T0R", "C2T2","C2T2R", "C0T0", "C0T0"),
stringsAsFactors = FALSE)
Assuming "dt" is your dataframe name, then you can use dplyr with case_when
library(tidyverse)
dt %>%
mutate(newcol = case_when(dt$chamber %in% c(1, 2, 9) ~ "c2t2",
dt$chamber %in% c(3, 7, 8) ~ "c0t0r",
dt$chamber %in% c(4, 5, 10) ~ "c2t2r",
dt$chamber %in% c(6, 11, 12) ~ "c0t0"))
Output:
Time chamber newcol
1 9 1 c2t2
2 10 2 c2t2
3 11 3 c0t0r
4 12 4 c2t2r
5 13 5 c2t2r
6 14 6 c0t0
7 15 7 c0t0r
8 16 8 c0t0r
9 17 9 c2t2
10 18 10 c2t2r
11 19 11 c0t0
12 20 12 c0t0
13 21 1 c2t2
14 22 2 c2t2
15 23 3 c0t0r
16 24 4 c2t2r
>
You can merge your df with the lookup_table. In my experience, if you want to combine different data.frames, merge() is the command I like to use. Do note that there are many different ways and specialised packages you can use for the same purpose!
You need to specify which column you use as the 'matching column' and also that you want to keep all records in df:
merge(df, lookup_table, all.x = TRUE, by.x = "chamber", by.y = "row.names")
Data:
df <- structure(list(Time = 9:24, chamber = c(1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L)),
.Names = c("Time", "chamber"), class = "data.frame",
row.names = c(NA, -16L))
lookup_table <- structure(list(new_col = c("C2T2", "C2T2", "C0T0R", "C2T2R",
"C2T2R", "C0T0", "C0T0R", "C0T0R",
"C2T2", "C2T2R", "C0T0", "C0T0")),
.Names = "new_col",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"), class = "data.frame")
I have 2 dataframes in R: 'dfold' with 175 variables and 'dfnew' with 75 variables. The 2 datframes are matched by a primary key (that is 'pid'). dfnew is a subset of dfold, so that all variables in dfnew are also on dfold but with updated, imputed values (no NAs anymore). At the same time dfold has more variables, and I will need them in the analysis phase. I would like to merge the 2 dataframes in dfmerge so to update common variables from dfnew --> dfold but at the same time retaining pre-existing variables in dfold. I have tried merge(), match(), dplyr, and sqldf packages, but either I obtain a dfmerge with the updated 75 variables only (left join) or a dfmerge with 250 variables (old variables with NAs and new variables without them coexist). The only way I found (here) is an elegant but pretty long (10 rows) loop that is eliminating *.x variables after a merge by pid with all.x = TRUE option). Might you please advice on a more efficient way to obtain such result if available ?
Thank you in advance
P.S: To make things easier, I have created a minimal version of dfold and dfnew: dfnew has now 3 variables, no NAs, while dfold has 5 variables, NAs included. Here it is the dataframes structure
dfold:
structure(list(Country = structure(c(1L, 3L, 2L, 3L, 2L), .Label = c("France",
"Germany", "Spain"), class = "factor"), Age = c(44L, 27L, 30L,
38L, 40L), Salary = c(72000L, 48000L, 54000L, 61000L, NA), Purchased = structure(c(1L,
2L, 1L, 1L, 2L), .Label = c("No", "Yes"), class = "factor"),
pid = 1:5), .Names = c("Country", "Age", "Salary", "Purchased",
"pid"), row.names = c(NA, 5L), class = "data.frame")
dfnew:
structure(list(Age = c(44, 27, 30), Salary = c(72000, 48000,
54000), pid = c(1, 2, 3)), .Names = c("Age", "Salary", "pid"), row.names = c(NA,
3L), class = "data.frame")
Although here the issue is limited to just 2 variables Please remind that the real scenario will involve 75 variables.
Alright, this solution assumes that you don't really need a merge but only want to update NA values within your dfold with imputed values in dfnew.
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 NA Yes 5
> dfnew
Age Salary pid
1 44 72000 1
2 27 48000 2
3 30 54000 3
4 38 61000 4
5 40 70000 5
To do this for a single column, try
dfold$Salary <- ifelse(is.na(dfold$Salary), dfnew$Salary[dfnew$pid == dfold$pid], dfold$Salary)
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
Using it on the whole dataset was a bit trickier:
First define all common colnames except pid:
cols <- names(dfnew)[names(dfnew) != "pid"]
> cols
[1] "Age" "Salary"
Now use mapply to replace the NA values with ifelse:
dfold[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[dfnew$pid == dfold$pid], x), dfold[,cols], dfnew[,cols])
> dfold
Country Age Salary Purchased pid
1 France 44 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
This assumes that dfnew only includes columns that are present in dfold. If this is not the case, use
cols <- names(dfnew)[which(names(dfnew) %in% names(dfold))][names(dfnew) != "pid"]
I have looked through other posts and I think I have an idea of what I could do, but I want to be clear!
I have a very large data frame that contains 4 variables and a number of rows.
Chain ResId ResNum Energy
1 C O17 500 -37.03670
2 A ARG 8 -0.84560
3 A LEU 24 -0.56739
4 A ASP 25 -0.98583
5 B ARG 8 -0.64880
6 B LEU 24 -0.58380
7 B ASP 25 -0.85930
Each row contains CHAIN (A, B, or C), ResID, ResNum, and Energy. I would like to sort this data so that all of the energy values belonging to a specific Resid and num in each chain are clustered together. By cluster I mean all of the values for "ARG 8" are grouped or all of the rows containing "ARG 8" are grouped. I don't know which is more efficient. Ideally, I would like the output for all residues to be
ARG 8
0.000
0.000
0.000
where the "0.000" are the energy values for ARG 8 or O17 and so on.
Sorry for the header breaks, I wanted the data to be clean, but I can't insert images.
data
structure(list(Chain = structure(c(3L, 1L, 1L, 1L, 2L, 2L, 2L
), .Label = c("A", "B", "C"), class = "factor"), ResId = structure(c(4L,
1L, 3L, 2L, 1L, 3L, 2L), .Label = c("ARG", "ASP", "LEU", "O17"
), class = "factor"), ResNum = c(500L, 8L, 24L, 25L, 8L, 24L,
25L), Energy = c(-37.0367, -0.8456, -0.56739, -0.98583, -0.6488,
-0.5838, -0.8593)), .Names = c("Chain", "ResId", "ResNum", "Energy"
), class = "data.frame", row.names = c(NA, -7L))
If you want to convert to wide format
library(reshape2)
dcast(df, ResId+ResNum~paste0('Energy.',Chain), value.var='Energy')
# ResId ResNum Energy.A Energy.B Energy.C
#1 ARG 8 -0.84560 -0.6488 NA
#2 ASP 25 -0.98583 -0.8593 NA
#3 LEU 24 -0.56739 -0.5838 NA
#4 O17 500 NA NA -37.0367
After your edit, the output you are most likely looking for is:
library(reshape2)
dcast(df, ResId~Chain, value.var= 'Energy')
ResId A B C
1 ARG -0.84560 -0.6488 NA
2 ASP -0.98583 -0.8593 NA
3 LEU -0.56739 -0.5838 NA
4 O17 NA NA -37.0367
This will put the values together. You can further specify based on your desired output.
df[order(df$ResId), ]
Chain ResId ResNum Energy
2 A ARG 8 -0.84560
5 B ARG 8 -0.64880
4 A ASP 25 -0.98583
7 B ASP 25 -0.85930
3 A LEU 24 -0.56739
6 B LEU 24 -0.58380
1 C O17 500 -37.03670
#With dplyr
library(dplyr)
df %>%
arrange(ResId)
Chain ResId ResNum Energy
1 A ARG 8 -0.84560
2 B ARG 8 -0.64880
3 A ASP 25 -0.98583
4 B ASP 25 -0.85930
5 A LEU 24 -0.56739
6 B LEU 24 -0.58380
7 C O17 500 -37.03670
Data
df <- read.table(text = '
Chain ResId ResNum Energy
C O17 500 -37.0367
A ARG 8 -0.8456
A LEU 24 -0.56739
A ASP 25 -0.98583
B ARG 8 -0.6488
B LEU 24 -0.5838
B ASP 25 -0.8593', header=T)
Try this:
df <- df[order(df$Chain, df$ResId, df$ResNum),]
where df is the name of your dataframe. This should order it for you.