I am attempting to delete a row like this:
data <- data[-1645,]
However, after running the code, the row is still there. I can tell because there is an outlier in that row that is showing up on all my graphs, and when I view the data I can sort a column to easily find the offending outlier. I have had no trouble deleting rows in the past- has anyone run into anything similar? I do understand the limitations of outlier removal and I don't typically remove them however for a number of reasons I would like to see what the data look like without this one (in this case, all other values in the response variable are between -1 and 0, and in this row the value is 10^4).
You really need to provide more information, but there are several ways you can troubleshoot the problem. The first one is to print out the line you are removing:
data[1645, ]
Is that the outlier? You did not tell us how you identified the outlier. If lines have been removed from the data frame, the row names are not changed but the index values are changed, e.g.
set.seed(42)
x <- sample.int(25)
y <- sample.int(25)
data <- data.frame(x, y)
head(data)
# x y
# 1 17 2
# 2 5 8
# 3 1 3
# 4 10 1
# 5 4 10
# 6 18 11
data <- data[-c(5, 10, 15, 20, 25), ]
head(data)
# x y
# 1 17 2
# 2 5 8
# 3 1 3
# 4 10 1
# 6 18 11
# 7 25 15
data[6, ]
# x y
# 7 25 15
data["6", ]
# x y
# 6 18 11
Notice that the 6th row of the data has a row name of "7" but the row with name "6" is the 5th row in the data frame because we deleted the 5th row. The which function will give you the index value, but if you identified the outlier by looking at the printout, you got the row name and that may be different from the index. If we want to remove values in x greater than 24, here is one way to do that:
data[data$x<25, ]
After playing around with the data, I think the best explanation is that the indexing is off. This is in line with what dcarlson was saying- that it could be removing the 1,645th row, it just isn't labelled as such. I think the best solution is to use subset:
data <- subset(data, Yield.Decline < 100)
This is a more robust solution than trying to remove any given row based on its value (the line can be accidentally run multiple times without erroneously removing additional lines).
Related
this may be a simple question but I'm fairly new to R.
What I want to do is to perform some kind of addition on the indexes of a list, but once I get to a maximum value it goes back to the first value in that list and start over from there.
for example:
x <-2
data <- c(0,1,2,3,4,5,6,7,8,9,10,11)
data[x]
1
data[x+12]
1
data[x+13]
3
or something functionaly equivalent. In the end i want to be able to do something like
v=6
x=8
y=9
z=12
values <- c(v,x,y,z)
data <- c(0,1,2,3,4,5,6,7,8,9,10,11)
set <- c(data[values[1]],data[values[2]], data[values[3]],data[values[4]])
set
5 7 8 11
values <- values + 8
set
1 3 4 7
I've tried some stuff with additon and substraction to the lenght of my list but it does not work well on the lower numbers.
I hope this was a clear enough explanation,
thanks in advance!
We don't need a loop here as vectors can take vectors of length >= 1 as index
data[values]
#[1] 5 7 8 11
NOTE: Both the objects are vectors and not list
If we need to reset the index
values <- values + 8
ifelse(values > length(data), values - length(data) - 1, values)
#[1] 1 3 4 7
I am trying to convert the data which I have in txt file:
4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512;...
to a column (table) where the values are separated by tab.
4.0945725440979
4.07999897003174
4.0686674118042...
So far I tried
mydata <- read.table("1.txt", header = FALSE)
separate_data<- strsplit(as.character(mydata), ";")
But it does not work. separate_data in this case consist only of 1 element:
[[1]]
[1] "1"
Based on the OP, it's not directly stated whether the raw data file contains multiple observations of a single variable, or should be broken into n-tuples. Since the OP does state that read.table results in a single row where s/he expects it to contain multiple rows, we can conclude that the correct technique is to use scan(), not read.table().
If the data in the raw data file represents a single variable, then the solution posted in comments by #docendo works without additional effort. Otherwise, additional work is required to tidy the data.
Here is an approach using scan() that reads the file into a vector, and breaks it into observations containing 5 variables.
rawData <- "4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512;4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512"
value <- scan(textConnection(rawData),sep=";")
columns <- 5 # set desired # of columns
observations <- length(aVector) / columns
observation <- unlist(lapply(1:observations,function(x) rep(x,times=columns)))
variable <- rep(1:columns,times=observations)
data.frame(observation,variable,value)
...and the output:
> data.frame(observation,variable,value)
observation variable value
1 1 1 4.094573
2 1 2 4.079999
3 1 3 4.068667
4 1 4 4.059601
5 1 5 4.052183
6 2 1 4.094573
7 2 2 4.079999
8 2 3 4.068667
9 2 4 4.059601
10 2 5 4.052183
>
At this point the data can be converted into a wide format tidy data set with reshape2::dcast().
Note that this solution requires that the number of data values in the raw data file is evenly divisible by the number of variables.
I have got the following problem. I have a data.frame with an x and y column representing some points in space:
X<-c(18.25743,18.25783,18.25823,18.25850,18.25863,18.25878,
18.25885,18.25912,18.25943,18.25962,18.25978,18.26000,
18.26022,18.26051,18.26070,18.26095,18.26118,18.26140,
18.26189,18.26250,18.26310,18.26390)
Y<-c(44.69561,44.69564,44.69567,44.69567,44.69586,
44.69600,44.69637,44.69671,44.69691,44.69701,44.69720,
44.69740,44.69763,44.69774,44.69787,44.69790,44.69791,
44.69795,44.69812,44.69802,44.69812,44.69834)
eDF<-data.frame(X,Y)
Now my problem is they are "sorted" wrong for plotting.So what I need is a function to write together the rows of the two points which belong together (in a list of lists):
1 and 12 is ID1
2 and 13 is ID2
3 and 14 is ID3
...
11 and 22 is ID11
Every so created list within the list of lists should have its unique ID (just numerating from 1 to the end). Well because I got this problem in all my data with different length.
It would be great if the starting point of the second consecutive row selecting (the 12) is flexible always taking the first row after half of the data.((rownumber/2)+1) in this example
12.
Well I have tried some things and i think Im on the right way but I cant figure out a solution by myself.
This function is pretty near but i cant manage to make it start at different rows(1 and 12):
lapply(2:nrow(eDF), function(x) eDF[(x-1):x,])
I also tried to figure it out with seq and it would do what i need if i could make a list of lists by connecting both code samples. Well I also need to change the concrete start and end numbers to a dynamic solution.
eDF[(seq(1,to=11,by=1)),] # selecting rows 1 to 11
eDF[(seq(12,to=nrow(eDF),by=1)),] #selecting rows 12 to end
Anyone any ideas?
I don't know if you needed an ID column inside of the new list but another way would be:
#create the IDs
eDF$ID <- rep(1:11,2)
#split the data.frame according to those
mylist <- split(eDF, eDF$ID)
Output:
mylist
$`1`
X Y ID
1 18.25743 44.69561 1
12 18.26000 44.69740 1
$`2`
X Y ID
2 18.25783 44.69564 2
13 18.26022 44.69763 2
$`3`
X Y ID
3 18.25823 44.69567 3
14 18.26051 44.69774 3
$`4`
X Y ID
4 18.2585 44.69567 4
15 18.2607 44.69787 4
#and so on...
You could only do split(eDF, rep(1:11,2) if you don't need the ID column.
We can modify the OP's lapply code
lapply(1:11, function(i) eDF[c(i, i+11),])
I'm fairly new to R so I'd like to apollogize in advance for eventually not choosing the best words to explain my issue.
My problem is that I'd like to create a subset out of a dataset (old) which has several colums. So far no problem...
My subset should start when the value (x) in one of the colums reaches its highest point; and stop right after x droppend down again to its lowest point.
Then create a new dataset (new) with this subset of the data (old).
As there are multiple positions in my original dataset (old) where the value x behaves as descibed above I'd like to have a new dataset (new1, new2, new....) for every subset I create.
I hope a was clear in what I'd like to say. If there is more information needed, I'm happy to provide it.
Thanks a lot for your help.
If for instance you have
x <- c(5,4,3,2,1,2,3,4,5,4,3,2,1,2,3,2,1)
Then
direction <- sign(diff(x))
will give a series of +1s and -1s indicating whether x is on an upward or downward swing. We're only interested in downward swings, so let's label upward points with NA, and downward points in the nth swing with the number n:
run <- rle(direction)
run$values[run$values==1] <- NA
run$values[!is.na(run$values)] <- 1:sum(!is.na(run$values))
Now it seems you want to include the last point in a run of downward points (where the sign is positive, as the point after the last point in a downward run is higher). So we need to extend the length of downward runs, and decrease the upward:
run$lengths <- run$lengths + ifelse(is.na(run$values), -1, +1)
swing <- inverse.rle(run)
plot(x, col=swing)
should colour downward runs in different colours, and omit upward runs. You've now got a variable that labels the runs, and you can split your data.frame by
split(myDataFrame, swing)
You might need to check this works if we start/finish a on an up or down swing
Here is an option where we check when direction changes with diff, and then split along that. First, make some data:
df <- data.frame(x=rep(c(1:3, 2:1), 3))
Then:
dir.vec <- c(diff(df$x) <= 0, tail(diff(df$x) <= 0, 1)) # has drop started?
split.vec <- cumsum(c(0, diff(dir.vec)) < 0) # which drop # is this?
split(df[dir.vec,,drop=F], split.vec[dir.vec]) # split drops by drop num
Original:
x
1 1
2 2
3 3
4 2
5 1
6 1
7 2
8 3
9 2
10 1
11 1
12 2
13 3
14 2
15 1
Split:
$`0`
x
3 3
4 2
5 1
$`1`
x
8 3
9 2
10 1
$`2`
x
13 3
14 2
15 1
I am trying to modify a R script but I have only basic experience with R:
question 1:
In line: for (i in 1:nrow(x)). what does the integer 1 actually do? Changing the value to 2 or higher seem to have a big effect on the output.
question 2:
I have been getting the message:
"Error in if (p[2] > a + b * p[1]) { :
missing value where TRUE/FALSE needed"
. In general, what might be causing this?
Any help is much appreciated!
question edited:
Say I have a dataframe for plotting scatterplot. The dataframe would be organized in the following fashion (in CSV format):
name ABC EFG
1 32 45
2 56 67
to, say 200 000 entries
I am going to first do a scatterplot, after which I am going to subset a portion of the dataset into A using alphahull and export them as XYZ. The script for doing this:
#plot first plot containing all data
plot(x = X$ABC,
y = X$EFG,
pch=20,
)
#subset data using ahull. choose 4 points on the plot
A <- ahull(locator(4, type="p", pch=20), alpha=10000)
#exporting subset
XYZ <- {}
for (i in 1:nrow(X)) { if (inahull(A, c(X$ABC[i],X$EFG[i]))) XYZ <- rbind(X,X[i,])}
I am getting the following message if the number of data points in the subset that I choose is too large:Error in if (p[2] > a + b * p[1]) { :
missing value where TRUE/FALSE needed
Question 1 - this is a for loop - it is executing once for each row in the matrix or data frame x (not sure what x is here exactly). Changing it to 2 will mean the loop happens one less time. Without the rest of the code I can't say much else.
Question 2 - can you post the whole code? It apparently needs to evaluate that expression and one or more of the values is missing.
Say you have data x
set.seed(123) # for reproducibility
x<-as.data.frame(rnorm(10)) # generate random number and store it as dataframe
k<-2 #assign n as 2
for (i in (1:nrow(x))){
cat("this is row",i,"\n")
show (k)
k<-k+i
}
show (k)
this is row 1
[1] 2
this is row 2
[1] 3
this is row 3
[1] 5
this is row 4
[1] 8
this is row 5
[1] 12
this is row 6
[1] 17
this is row 7
[1] 23
this is row 8
[1] 30
this is row 9
[1] 38
this is row 10
[1] 47
> show (k)
[1] 57