I have a daily time series for 20 years(column 1 dates and other columns different data), and one row is deleted, which I don't know which one is.
I want to find that row and insert related date in that row and also interpolate other columns for that row!
Is it possible in R?
Thanks
Supposing your date column is of class "Date", here's a way:
# generate sample data
my.df <- data.frame(date=Sys.Date(), other=rnorm(1))
for(i in 2:100) {
my.df[i,] <- list(Sys.Date() + (i-1), rnorm(1))
}
class(my.df$date)
# [1] "Date"
# remove row 71
my.df <- my.df[-71,]
# Iterate to see where there is a gap
for(i in 2:nrow(my.df)) {
if(my.df$date[i] != my.df$date[i-1] + 1) {
cat("missing row:", i)
break
}
}
missing row: 71
Related
I am looking for a function that iterates through the rows of a given column ("pos" for position, ascending) in a dataframe, and only keeps those rows whose values are at least let's say 10 different, starting with the first row.Thus it would start with the first row (and store it), and then carry on until it finds a row with a value at least 10 higher than the first, store this row, then start from this value again looking for the next >10diff one.
So far I have an R for loop that successfully finds adjacent rows at least X values apart, but it does not have the capability of looking any further than one row down, nor of stopping once it has found the given row and starting again from there.
Here is the function I have:
# example data frame
df <- data.frame(x=c(1:1000), pos=sort(sample(1:10000, 1000)))
# prep function (this only checks row above)
library(dplyr)
pos.apart.subset <- function(df, pos.diff) {
# create new dfs to store output
new.df <- list()
new.df1 <- data.frame()
# iterate through each row of df
for (i in 1:nrow(df)) {
# if the value of next row is higher or equal than value or row i+posdiff, keep
# if not ascending, keep
# if first row, keep
if(isTRUE(df$pos[i+1] >= df$pos[i]+pos.diff | df$pos[i+1] < df$pos[i] | i==1 )) {
# add rows that meet conditions to list
new.df[[i]] <- df[i,] }
}
# bind all rows that met conditions
new.df1 <- bind_rows(new.df)
return(new.df1)}
# test run for pos column adjacent values to be at least 10 apart
df1 <- pos.apart.subset(df, 10); head(df1)
Happy to do this in awk or any other language. Many thanks.
It seems I misunderstood the question earlier since we don't want to calculate the difference between consecutive rows, you can try :
nrows <- 1
previous_match <- 1
for(i in 2:nrow(df)) {
if(df$pos[i] - df$pos[previous_match] > 10) {
nrows <- c(nrows, i)
previous_match <- i
}
}
and then subset the selected rows :
df[nrows, ]
Earlier answer
We can use diff to get the difference between consecutive rows and select the row which has difference of greater than 10.
head(subset(df, c(TRUE, diff(pos) > 10)))
# x pos
#1 1 1
#2 2 31
#6 6 71
#9 9 134
#10 10 151
#13 13 185
The first TRUE is to by default select the first row.
In dplyr, we can use lag to get value from previous row :
library(dplyr)
df %>% filter(pos - lag(pos, default = -Inf) > 10)
I would like to count occurrences within the data frame with 100 rows for 100 users and 5 columns for the userID, all conducted events and the thress events separately.
For each user I would like to count in column 3 to 5 the events separately which are listed in column 2 together in "" and separated by a comma (for example (c("stroke", "mouseclick1","mouseclick2")).
My code looks like this:
frame <- data.frame(matrix(ncol = 5, nrow = length(my.data)))
x <-c("user","eventsall","mouseclick1","mouseclick2","stroke")
colnames(frame) <- x
frame$user <- c(1:length(my.data))
frame$eventsall <- as.character(frame$workflow)
frame$mouseclick1 <- ?????
frame$mouseclick2 <- ?????
frame$stroke <- ?????
How can I define the three variables (above) so that I am able to count the frequency of each event for each user within the frame?
The first loop is correct but the second is wrong which I could repeat for
mouseclick2 and stroke. Is the function str_count correct?
for (i in frame$user) {
if (is.na(my.data[[i]][["scenario1"]]) == TRUE) {
frame$eventsall[i] <- NA
}
else {
frame$eventsall[i] <- list(my.data[[i]][["scenario1"]][["events.all"]])
}
}
for (i in frame$user) {
if (is.na(my.data[[i]][["scenario1"]][["events.all"]]) == TRUE) {
frame$mouseclick1[i] <- NA
}
else {
frame$mouseclick1[i,3] <- str_count(my.data[[i]][["scenario1"]][["events.all", pattern="mouseclick1"]])
}
}
View(frame)
Thanks a lot!
You can split the comma delimited string using strsplit and then loop through each row of the data.
# Sample data since none was provided
frame <- data.frame(user=c(1:5),
eventsall=c('1,2,3',
'3,4,6',
'5,3,2',
'7,4,5',
'6,6,5'))
frame$eventsall <- as.character(frame$eventsall)
events.split <- strsplit(frame$eventsall,',')
for(i in 1:nrow(frame)){
frame$mouseclick1[i] <- events.split[[i]][1]
frame$mouseclick2[i] <- events.split[[i]][2]
frame$stroke[i] <- events.split[[i]][3]
}
I am trying to do a for loop which would search over every row in data frame, but just the
first column checking the tag ID, and if its not it, then it should move to the next row and so on until it finds the value or get to the end of the data frame.
Then the row as a result should be printed.
The purpose is just checking how the for loop works and how "slow" it is ( I want it to compare to other way of searching). I am a bit inexperienced in R and programming general.
Progress so far/my code
Thus far I have done this code and the stopping point is how to make the function move to the other column and check it and move to the next.
SearchID = function(data,value) {
for(i in 1:nrow(testdata)) {
row <- testdata[i,1]
if("row" == "value") return(row)
#what now?
}
}
This is an reproducible example:
ID=c("ID43","ID23","ID14","ID14")
y=c(23,45,66,76)
k=c("yes","no","yes","no")
testdata= data.frame(ID,y,k)
If I give the ID14 as value, it should return the whole row with the ID14:
ID y k
4 ID14 76 no
Here, first you create an object d1 to hold the rows that match the value with the ID column. We are looping through each row of the data with the for loop and check the condition. If it matches, then use rbind to bind that row with the created object. You can also initialize d1 as d1 <- data.frame().
SearchID <- function(data,value) {
d1 <- c()
for(i in 1:nrow(data)) {
row <- data[i,1]
if(row==value){
d1 <- rbind(d1,data[i,])
}
}
d1
}
SearchID(testdata, 'ID14')
# ID y k
#3 ID14 66 yes
#4 ID14 76 no
SearchID(testdata, 'ID43')
# ID y k
#1 ID43 23 yes
SearchID(testdata, "ID86")
#NULL
It's not clear what assumptions about R knowledge can be made in trying to answer this question. For instance, if we could assume that we know how to extract rows based on a vector of row indices, the following would seem more natural to me than binding the rows one-by-one:
SearchID = function(data,value) {
getme <- numeric() # Initialize empty vector
for(i in 1:nrow(data)) { # Start the loop
row <- data[i, 1] # Capture the relevant value
# Compare, and If there's a match,
if (row == value) getme <- c(getme, i) # add loop index to "getme" vector
}
if (length(getme) == 0) NULL # If the vector is still empty, NULL
else data[getme, ] # else return the relevant rows
}
SearchID(testdata, "ID14")
# ID y k
# 3 ID14 66 yes
# 4 ID14 76 no
At the very least, this answer should give you something else to benchmark against :-)
My data is 18 rows by 8 columns. It contains both numerical and word data. I want to assing each row an ID number. I want to group the rows with the same info in the first 5 columns by the same ID number. For some reason I don't think I am looping properly. Any thoughts?
sampdata<-read.csv("xxx")
sampdata["ID"] <- 0 #ID column
count<-1 #to subtract from 10000
for (p in 1:18) {
if (sampdata[p,9] == 0){
count<-count+1
sampdata[9,p]<-10000-count
for (i in 1:5){ #column index for current check (only check defining info)
for (j in 1:18) { #row index for current check
for (k in 1:18){ #column index for current check against
if (sampdata[i,j]==sampdata[i,k])
sampdata[j,9]<-sampdata[9,p] #assign same ID number
}
}
}
}
}
Assuming your data looks something like this
mm<-matrix(c(
1,1,2,2,3, 1,1,2,2,3,
2,2,2,3,3, 2,2,2,3,3,
4,3,2,1,2, 1,1,2,2,3,
3,1,1,2,2
), byrow=T, ncol=5)
dd<-data.frame(mm[,1:3],
X4=letters[mm[,4]], X5=mm[,5],
matrix(runif(nrow(mm)*(18-ncol(mm))), nrow=nrow(mm)))
Where your data is in dd and the first 5 columns define a group. You can use interaction() to assign a unique ID to each group like this
dd$ID <- as.numeric(interaction(dd[,1:5], drop=T, lex.order=T))
I've got data being read into a data frame R, by column. Some of the columns will increase in value; for those columns only, I want to replace each value (n) with its difference from the previous value in that column. For example, looking at an individual column, I want
c(1,2,5,7,8)
to be replaced by
c(1,3,2,1)
which are the differences between successive elements
However, it's getting really late in the day, and I think my brain has just stopped working. Here's my code at present
col1 <- c(1,2,3,4,NA,2,3,1) # This column rises and falls, so we want to ignore it
col2 <- c(1,2,3,5,NA,5,6,7) # Note: this column always rises in value, so we want to replace it with deltas
col3 <- c(5,4,6,7,NA,9,3,5) # This column rises and falls, so we want to ignore it
d <- cbind(col1, col2, col3)
d
fix_data <- function(data) {
# Iterate through each column...
for (column in data[,1:dim(data)[2]]) {
lastvalue <- 0
# Now walk through each value in the column,
# checking to see if the column consistently rises in value
for (value in column) {
if (is.na(value) == FALSE) { # Need to ignore NAs
if (value >= lastvalue) {
alwaysIncrementing <- TRUE
} else {
alwaysIncrementing <- FALSE
break
}
}
}
if (alwaysIncrementing) {
print(paste("Column", column, "always increments"))
}
# If a column is always incrementing, alwaysIncrementing will now be TRUE
# In this case, I want to replace each element in the column with the delta between successive
# elements. The size of the column shrinks by 1 in doing this, so just prepend a copy of
# the 1st element to the start of the list to ensure the column length remains the same
if (alwaysIncrementing) {
print(paste("This is an incrementing column:", colnames(column)))
column <- c(column[1], diff(column, lag=1))
}
}
data
}
fix_data(d)
d
If you copy/paste this code into RGui, you'll see that it doesn't do anything to the supplied data frame.
Besides losing my mind, what am I doing wrong??
Thanks in advance
Without addressing the code in any detail, you're assigning values to column, which is a local variable within the loop (i.e. there is no relationship between column and data in that context). You need to assign those values to the appropriate value in data.
Also, data will be local to your function, so you need to assign that back to data after running the function.
Incidentally, you can use diff to see if any value is incrementing rather than looping over every value:
idx <- apply(d, 2, function(x) !any(diff(x[!is.na(x)]) < 0))
d[,idx] <- blah
diff calculates the difference between consecutive values in a vector. You can apply it to each column in a dataframe using, e.g.
dfr <- data.frame(x = c(1,2,5,7,8), y = (1:5)^2)
as.data.frame(lapply(dfr, diff))
x y
1 1 3
2 3 5
3 2 7
4 1 9
EDIT: I just noticed a few more things. You are using a matrix, not a data frame (as you stated in the question). For your matrix 'd', you can use
d_diff <- apply(d, 2, diff)
#Find columns that are (strictly) increasing
incr <- apply(d_diff, 2, function(x) all(x > 0, na.rm=TRUE))
#Replace values in the approriate columns
d[2:nrow(d),incr] <- d_diff[,incr]