change values of column to consecutive integers - r

I have a CSV where the first column is my subject ID number,
there's a total of 311 subjects, and on average 1000 values per subject.
The subject ID numbers are quite random (although integers only) ranging from 72 to 2265988.
What I would like to do is neatly rename them to numbers 1-311.
What would be the quickest way to do this in R, preferably, or in excel?

OK. Here is your solution. You just replace with your names:
df <- data.frame(ID = c(1,3,5,3,6,7),
var = c(3,6,8,5,7,8))
temp <- data.frame(old_id = sort(unique(df$ID)),
new_id =seq(1,5))
replace_ID <- temp$new_id[match(df$ID, temp$old_id)]
df$ID <- replace_ID
df

Related

Deleting a subset of rows based on other variables

I have followed this example Remove last N rows in data frame with the arbitrary number of rows but it just deletes only the last 50 rows of the data frame rather than the last 50 rows of every study site within the data frame. I have a really big data set that has multiple study sites and within each study site there's multiple depths and for each depth, a concentration of nutrients.
I want to just delete the last 50 rows of depth for each station.
E.g.
station 1 has 250 depths
station 2 has 1000 depths
station 3 has 150 depth
but keep all the other data consistent.
This just seems to remove the last 50 from the dataframe rather than the last 50 from every station...
df<- df[-seq(nrow(df),nrow(df)-50),]
What should I do to add more variables (study site) to filter by?
A potential base R solution would be:
d <- data.frame(station = rep(paste("station", 1:3), c(250, 1000, 150)),
depth = rnorm(250 + 1000 + 150, 100, 10))
d$grp_counter <- do.call("c", lapply(tapply(d$depth, d$station, length), seq_len))
d$grp_length <- rep(tapply(d$depth, d$station, length), tapply(d$depth, d$station, length))
d <- d[d$grp_counter <= (d$grp_length - 50),]
d
# OR w/o auxiliary vars: subset(d, select = -c(grp_counter, grp_length))
we can use slice function from dplyr package
df2<-df %>% group_by(Col1) %>% slice(1:(n()-4))
At first it groups by category column and if arranged in proper order it can remove last n number of rows (in this case 4) from dataframe for each category.

How do I reformat a data set to have this particular structure without for loops?

I am trying to reorganize some raw data into a more condense form. Currently the data looks like the below output from the R code. I would like the final output to have columns for time, ID, and all possible desired prices. Then, I want each ID to have only one row for each time with the quantity number put in at the different desired prices(so how many an ID wants at a particular price during this time). So for example, a particular ID might have a quantity of 1 at 100 and quantity of 2 at 101. If it is a buy, then the value should be negative and if it is a sell then positive. For example, -1 for buy at 100 and 2 for sell at 101.
I originally tried doing it through a double for loop with the first loop being time and then the second loop being the ID. Then I was able to look at the quantity column and desired price for an ID and put them into a vector. Afterwards, I combined all the vectors together for that time and then repeated this. When I tried to use this in practice, it was not feasible because the code was too slow as there are hundreds of IDs and thousands of times.
Can someone help me do this in a faster and cleaner way?
set.seed(1)
time <- rep(seq(1,5), , each = 15)
id <- sample(342:450,75,replace = TRUE)
price <- sample(99:103,75,replace = TRUE)
Desire.Price <- sample(97:105,75,replace = TRUE)
quantity <- sample(1:4,75,replace = TRUE)
data <- data.frame(time = time, id = id,price = price, Desire.Price = Desire.Price,quantity = quantity)
data$buysell <- 0
data$buysell <- ifelse( data$Desire.Price <= data$price, "BUY","SELL")
I expect the final data set would look something like this.
Final.df <- data.frame(time=NA,id=NA,"97" = NA,"98"=NA ,"99"=NA,"100"=NA,"101"=NA,"102"=NA,"103"=NA
,"104"=NA,"105"=NA)
It would basically condense the original raw data to have all the information for a particular ID in a row during each time period.
Edit: If an ID did not get sampled in that time (for example ID 342 is not in time 1) they should have a row of NA in that time period( So ID 342 would have a row of NA in time 1). I edited the code that generates the samples to have more ids to reflect this( So that they can't all possibility be sampled in every time period).
Here's a tidyverse approach. First, make quantity signed based on BUY/SELL, then sum quantity for each id / time / Desire.Price, then spread those into wide format with a column for each Desire.Price.
library(dplyr); library(tidyr)
data %>%
mutate(quantity_signed = if_else(buysell == "BUY", -quantity, quantity)) %>%
count(id, time, Desire.Price, wt = quantity_signed) %>%
complete(id, time) %>% # EDIT to bring in all times for all id's
spread(Desire.Price, n) %>% View("output")
I think this approach is simple comparatively.
# Code
library(reshape2)
#Turning BUY quantity values negative.
data[which(data$buysell=="BUY"),]$quantity <- -(data[which(data$buysell=="BUY"),]$quantity)
#Using dcast function to achieve desired columns.
final.df <- dcast(data,time + id~Desire.Price ,fun=sum,value.var='quantity')

R: returning the 5 rows with the highest values

Sample data
mysample <- data.frame(ID = 1:100, kWh = rnorm(100))
I'm trying to automate the process of returning the rows in a data frame that contain the 5 highest values in a certain column. In the sample data, the 5 highest values in the "kWh" column can be found using the code:
(tail(sort(mysample$kWh), 5))
which in my case returns:
[1] 1.477391 1.765312 1.778396 2.686136 2.710494
I would like to create a table that contains rows that contain these numbers in column 2.
I am attempting to use this code:
mysample[mysample$kWh == (tail(sort(mysample$kWh), 5)),]
This returns:
ID kWh
87 87 1.765312
I would like it to return the r rows that contain the figures above in the "kWh" column. I'm sure I've missed something basic but I can't figure it out.
We can use rank
mysample$Rank <- rank(-mysample$kWh)
head(mysample[order(mysample$Rank),],5)
if we don't need to create column, directly use order (as #Jaap mentioned in three alternative methods)
#order descending and get the first 5 rows
head(mysample[order(-mysample$kWh),],5)
#order ascending and get the last 5 rows
tail(mysample[order(mysample$kWh),],5)
#or just use sequence as index to get the rows.
mysample[order(-mysample$kWh),][1:5]

In R, select rows that have one column that exists in another list

I'm new to R; have a simple stumbling block for which I've been searching for an answer for too long.
Dateframe includes a list of individuals with their performance over a five year period. The analysis needs to include only those individuals that participated in the most recent year, so I need to identify those individuals and then select all records from the original data frame for those individuals with all columns (there's 50 or more other columns).
Original data frame is performance_fiveyr; variables I'm working with are person_id and year. I have tried any number of possible ways to get what I need; I'm listing one of those ways here...
First step is to create the list of individuals that participated this past year
person_current <- subset (x = performance_fiveyr,
subset = year==2015, # keep only records from 2015
select = person_id # keep only the person_id variable
)
Next step then is to select from performance_fiveyr all rows that have a person_id that exists in person_current and return all other columns (more than 50 columns total).
performance_current <- performance_fiveyr[performance_fiveyr$person_id
%in% person_current, ]
I've tried more than a few variations of this and end up with either all columns and no rows or all rows and no variables.
Here is some example data:
set.seed(0)
p5 <- data.frame(id = sample(5, 20, replace=TRUE), year = sample(2010:2015, 20, replace=TRUE))
p5 <- p5[order(p5$id, p5$year), ]
I think you were on the right track. I think the below does what you are after:
current <- unique(p5[p5$year==2015, 'id'])
p_current <- p5[p5$id %in% current, ]
p_current

R find X percent growth within Y time frame

I have a R data frame with 3 columns
timestamp
Category
Value
I am trying to find an elegant way(ideally) to find where the values increased, or decreased, by X percent within a specified time frame. For example, I'd like to know all points in the data where Value increased by 50% or more within 1 week.
Are there any built in funcitons of packages where I can just pass a percentage and a number of days and have it return which rows in the data frame are a match?
something along these lines(pseudo code below):
RowsThatareAMatch <- findmatches(date=MyDF$Timestamp, grouping=MyDF$Category, data=MyDF$Value, growth=0.5, range=7)
The thing that is throwing me off is that I want it returning the rows for each Category that has values, and not just look at every value in the data frame. So if Category A & B had growth of 50% or more within 7 days 8 times in my data, I want those rows returned, and if categories C, D, & E didn't every have that kind of increase I don't want data from those categories returned at all.
Right now I am looking at systematically splitting the data frame into multiple data frames for each category and then doing the analysis on each individual data frame. While that approach could work, something is telling me that R has an easier way to do this.
Thoughts?
edit: Ideally what I am looking for returned is a data frame with 3 columns, and 1 row for every match in my data.
Category
Start timestamp of the match
End timestamp of the match.
Based in my experience with R I would need to identify the row numbers for each grouping and then I could extract the above data from the original data frame, but if there's any good way to go straight to the above output that would be awesome too!
Sample Data
So I have a CSV like this:
Timestamp,Category,Value
2015-01-01,A,1
2015-01-02,A,1.2
2015-01-03,A,1.3
2015-01-04,A,8
2015-01-05,A,8.2
2015-01-06,A,9
2015-01-07,A,9.2
2015-01-08,A,10
2015-01-09,A,11
2015-01-01,B,12
2015-01-02,B,12.75
2015-01-03,B,15
2015-01-04,B,60
2015-01-05,B,62.1
2015-01-06,B,63
2015-01-07,B,12.3
2015-01-08,B,10
2015-01-09,B,11
2015-01-01,C,100
2015-01-02,C,100000
2015-01-03,C,200
2015-01-04,C,350
2015-01-05,C,780
2015-01-06,C,780.2
2015-01-07,C,790
2015-01-08,C,790.3
2015-01-09,C,791
2015-01-01,D,0.5
2015-01-02,D,0.8
2015-01-03,D,0.83
2015-01-04,D,2
2015-01-05,D,0.01
2015-01-06,D,0.03
2015-01-07,D,0.99
2015-01-08,D,1.23
2015-01-09,D,5
I would read that into R like this
df <- read.csv("CategoryMeasurements.csv", header=TRUE)
Say you your data.frame is called df, you can do something like this using data.table which creates a new row that reads "increase over 50%" if the value grew by 50% or more (which you can then filter):
lag <- function(x, n) c(rep(NA, n), x[1:(length(x) - n)])
library(data.table)
setDT(df)[, ifelse(value/lag(value, 1) - 1 > 0.5, "increase over 50%", "Other"), by = category]
Well I'm not sure how elegant this is, but it works, and I wound up having to subset by category before passing the data frame to my function and will need to create a loop or use one of the apply functions to pass each category to my function, but it should get the job done.
Mydf <- read.csv("CategoryMeasurements.csv", header=TRUE)
GetIncreasesWithinRange <- function(df, growth, days ) {
# df = data frame with data you want processed. 1st column should be a date, 2nd column should be the data.
# growth = % of growth you are looking for in the data
# days = the number of days that the growth should occur in to be a match.
df <- df[order(df[,1]), ] # Sort the df by the date column. This is important for the loop logic.
# Initialize empty data frame to hold results that will be returned from this funciton.
ReturnDF <- data.frame( StartDate=as.Date(character()),
EndDate=as.Date(character()),
Growth=double(),
stringsAsFactors=FALSE)
TotalRows = nrow(df)
for(i in 1:TotalRows) {
StartDate <- toString(df[i,1])
StartValue <- df[i,2]
for(x in i:(TotalRows)) {
NextDate <- toString(df[x,1])
DayDiff <- as.numeric(difftime(NextDate ,StartDate , units = c("days")))
if(DayDiff >= days) {
NextValue <- df[x,2]
PercentChange = (NextValue - StartValue)/NextValue
if(PercentChange >= growth) {
ReturnDF[(nrow(ReturnDF)+1),] <- list(StartDate, NextDate, PercentChange)
}
break
}
}
}
return(ReturnDF)
}
subDF <- Mydf[which(Mydf$Category=='A'), ]
subDF$Category <- NULL # Nuke the category column from the subsetting DF. It's not relevant for this.
X <- GetIncreasesWithinRange(subDF, 0.5, 4)
print(X)
Which outputs
StartDate EndDate Growth
1 2015-01-01 2015-01-05 0.8780488
2 2015-01-02 2015-01-06 0.8666667
3 2015-01-03 2015-01-07 0.8586957

Resources