Calculating value between two csv in R - r

I have a two data table (csv) which contain information about a MOOC course.
The first table contains information about mouse movments (distance). Like this:
1-2163.058../2-20903.66351.../3-25428.5415..
The first number means the day (1- first day, 2- second day, etc.) when it happens, the second number means the distance in pixel. (2163.058, 20903.66351, etc.).
The second table contains the same information but instead of distance, there is the time was recorded. Like this:
1-4662.0/2-43738.0/3-248349.0....
The first number means the day (1- first day, 2- second day, etc.) when it happens, the second number means the time in milliseconds.
In the table, every column records a data from the specific web page, and every row records a user behaviour on this page.
I want to create a new table with the same formation, where I want to count the speed by pixel. Divide the distance table with time table which gives new table with same order, shape.
Here are two links for the two tables goo.gl/AVQW7D goo.gl/zqzgaQ
How can I do this with raw csv?
> dput(distancestream[1:3,1:3])
structure(list(id = c(2L, 9L, 10L),
`http//tanul.sed.hu/mod/szte/frontpage.php` = structure(c(2L, 1L, 1L),
.Label = c("1-0", "1-42522.28760403924"),
class = "factor"),
`http//tanul.sed.hu/mod/szte/register.php` = c(0L, 0L, 0L)),
.Names = c("id", "http//tanul.sed.hu/mod/szte/frontpage.php",
"http//tanul.sed.hu/mod/szte/register.php"),
class = c("data.table", 0x0000000002640788))
> dput(timestream[1:3,1:3])
structure(list(id = c(2L, 9L, 10L),
`http//tanul.sed.hu/mod/szte/frontpage.php` = structure(c(2L, 1L, 1L),
.Label = c("0", "1-189044.0"),
class = "factor"),
`http//tanul.sed.hu/mod/szte/register.php` = c(0L, 0L, 0L)),
.Names = c("id",
"http//tanul.sed.hu/mod/szte/frontpage.php",
"http//tanul.sed.hu/mod/szte/register.php"),
class = c("data.table", 0x0000000002640788))

This may not be the most efficient method, but I believe it should yield the result you are looking for.
# Set file paths
dist.file <- # C:/Path/To/Distance/File.csv
time.file <- # C:/Path/To/Time/File.csv
# Read data files
dist <- read.csv(dist.file, stringsAsFactors = FALSE)
time <- read.csv(time.file, stringsAsFactors = FALSE)
# Create dataframe for speed values
speed <- dist
speed[,2:ncol(speed)] <- NA
# Create progress bar
pb <- txtProgressBar(min = 0, max = ncol(dist) * nrow(dist), initial = 0, style = 3, width = 20)
item <- 0
# Loop through all columns and rows of distance data
for(col in 2:ncol(dist)){
for(r in 1:nrow(dist)){
# Check that current item has data to be calculated
if(dist[r,col] != 0 & dist[r,col] != "1-0" & !is.na(time[r,col])){
# Split the data into it's separate day values
dists <- lapply(strsplit(strsplit(dist[r,col], "/")[[1]], "-"), as.numeric)
times <- lapply(strsplit(strsplit(time[r,col], "/")[[1]], "-"), as.numeric)
# Calculate the speeds for each day
speeds <- sapply(dists, "[[", 2) / sapply(times, "[[", 2)
# Paste together the day values and assign to the current item in speed dataframe
speed[r,col] <- paste(sapply(dists, "[[", 1), format(speeds, digits = 20), sep = "-", collapse = "/")
} else{
# No data to calculate, assign 0 to current item in speed dataframe
speed[r,col] <- 0
}
# Increase progress bar counter
item <- item + 1
setTxtProgressBar(pb,item)
}
}
# Create a csv for speed data
write.csv(speed, "speed.csv")

Related

How do I merge two large data.frames and take a select portion of these values?

specdata <- list.files(getwd(), pattern="*.csv")
directory <- lapply(specdata, read.csv)
directory_final <- do.call(rbind, directory)
library(tidyverse)
one <- select(directory_final, nitrate, ID)
two <- no.omit(one)
a <- select(directory_final, sulfate, ID)
b <- na.omit(a)
two_df <- mutate(two, id = rownames(two))
b_df <- mutate(b, id = rownames(b))
library(plyr)
alpha <- join(two_df, b_df, by = "id", match = "all")
alpha$id <- NULL
dput(head(alpha, 5))
structure(list(sulfate = c(7.21, 5.99, 4.68, 3.47, 2.42), ID = c(1L,
1L, 1L, 1L, 1L), nitrate = c(0.651, 0.428, 1.04, 0.363, 0.507
), ID = c(1L, 1L, 1L, 1L, 1L)), row.names = c(NA, 5L), class = "data.frame")
dim(alpha)
118783 4
Think of it like this; I have two long strings, one string extends for 10m and the other 12m. One string is red and the other blue. both strings have knots at 0.05 cm intervals all along the entire string. At every 10 knots, I give each individual knot the ID-1 for red and ID1-1 for blue and so forth. I have each string on each hand, however; I want these two strings to be one long string, merged side-by-side. So I tie the top and end of the string. Now if I want an individual knot, from ID-1, 1/10 length of the ID-1 string, I untie the first and so forth. – I want a function that lets me find the mean of every knot I untie either from ID-1 ranging from 1:332, or ID1-1 ranging from 1:332.
I want something like
alpha_function(nitrate, ID = 1:50)
alpha_function(sulfate, ID = 1:50)
A function that can gather all the mean values of nitrate or sulfate by ID
also, when I use the 'join' function, I can only take mean values of the first data.frame (b_df), that I place in this function. whereas, the second always returns NA.
mean(alpha$sulfate)
3.189369
mean(alpha$nitrate)
NA
I would like to also know as to why this happens and how it can be fixed so both total values can be taken?
The following function might help:
combine.df <- function(df1,df2){
n <- max(nrow(df1),nrow(df2))
cbind(df1[1:n,],df2[1:n,])
}
The logic of the function is that R automatically inserts NA when you give it indices which are out of range.
In the event that the dataframes have differing amount of rows, the excess rows will have names like NA, NA.1, NA.2, .... If you don't like that then you could use the following version of this function:
combine.df <- function(df1,df2){
n <- max(nrow(df1),nrow(df2))
df <- cbind(df1[1:n,],df2[1:n,])
row.names(df) <- 1:n
df
}

R: Select value from a different column for each row

I have a large data frame (cut down to first 5 rows here) comprised of radio-telemetry readings from multiple antennas. Normally there are 10,000+ rows of data like this every couple of weeks.
structure(list(freq.id = c(13, 13, 13, 13, 13), DT = structure(c(1393835337,
1393921137, 1393879437, 1393881387, 1393920987), class = c("POSIXct",
"POSIXt"), tzone = "America/Bogota"), S1 = c(-13624L, -12866L,
-13291L, -13415L, -13002L), N1 = c(-13969L, -13824L, -13868L,
-13881L, -13911L), S2 = c(-14114L, -14026L, -13957L, -13969L,
-14052L), N2 = c(-14211L, -14238L, -14168L, -14148L, -14211L),
S3 = c(-13245L, -13113L, -12801L, -12860L, -13133L), N3 = c(-13816L,
-13832L, -13878L, -14001L, -13706L), S4 = c(-13479L, -12702L,
-12388L, -12501L, -12692L), N4 = c(-13872L, -13820L, -13992L,
-13905L, -13798L), S5 = c(-12516L, -11485L, -10871L, -10900L,
-11452L), N5 = c(-13884L, -13995L, -13804L, -13840L, -13929L
), S6 = c(-12661L, -12168L, -10982L, -11112L, -12164L), N6 = c(-13911L,
-13914L, -13078L, -13778L, -13911L), PW = c(20L, 20L, 20L,
20L, 21L), PI = c(1078L, 1078L, 1080L, 2156L, 1078L), aru.unk = c(2072L,
2058L, 2014L, 2052L, 2047L), msrfreq = c(164421600L, 164421700L,
164421400L, 164421300L, 164421800L), TOWERID = structure(c(1L,
1L, 1L, 1L, 1L), .Label = c("TOWER4", "TOWER5", "TOWER6",
"TOWER7"), class = "factor"), prog.freq = structure(c(9L,
9L, 9L, 9L, 9L), .Label = c("162.7920", "162.9774", "163.0780",
"163.6804", "163.8600", "164.0309", "164.2930", "164.3950",
"164.4220", "164.4350", "164.5040", "164.5430", "164.5620",
"164.7026", "164.7840", "164.8230", "164.8430", "164.9338",
"165.5000"), class = "factor")), .Names = c("freq.id", "DT",
"S1", "N1", "S2", "N2", "S3", "N3", "S4", "N4", "S5", "N5", "S6",
"N6", "PW", "PI", "aru.unk", "msrfreq", "TOWERID", "prog.freq"
), row.names = 40615:40619, class = "data.frame")
Columns S1,S2...S6 are signal values from different antennas and N1,N2...N6 are corresponding noise values
I am trying to pull out the largest and second largest signal values for each row and their corresponding noise values. I can get the the signal values, as well as it's "index" of just the columns of signal.
maxn <- function(n) function(x) order(x, decreasing = TRUE)[n]
mydata$strongest<-apply(mydata[,c(3,5,7,9,11,13)],1,function(x) x[maxn(1)(x)])
#columns 3,5,6,11,13 are the subset of columns containing signal values
mydata$secondstrongest<-apply(mydata[,c(3,5,7,9,11,13)],1,function(x) x[maxn(2)(x)])
mydata$strongestantenna<-apply(mydata[,c(3,5,7,9,11,13)],1,maxn(1))
# returns 5 because in the first 5 rows, the strongest signal is the 5th antenna (S5)
mydata$secondstrongestantenna<-apply(mydata[,c(3,5,7,9,11,13)],1,maxn(2))
#returns a 6
I'm stuck trying to create 2 new columns that extract the noise values for the antennas that have the 1st and 2nd strongest signals. I was hoping to use the place index (1-6) for each antenna to pull out the correct noise values like this, but it isn't working. It pulls the correct value, but repeats it the same number of times as the value of mydata$strongantenna
mydata$strongantennanoise<-mydata[c(4,6,8,10,12,14)][mydata$strongestantenna],
#Columns 4,6,8,10,and 12 are the noise values
The strongest and second strongest antennas don't change here, but do in the data, as the animal being tracked moves around.
I feel like I'm overlooking something simple, but I can't figure it out. I appreciate whatever help you can give.
# Get names of the strongest and second strongest antennas by row:
strongest <- apply(mydata[,c(3,5,7,9,11,13)],1, function(x) names(x[maxn(1)(x)]))
secondstrongest <- apply(mydata[,c(3,5,7,9,11,13)],1, function(x) names(x[maxn(2)(x)]))
# Get column index for associated noise columns
biggest.noise.col <- sapply(seq_along(mydata[,1]),
function(x) which(colnames(mydata) == strongest[x]) +1)
second.biggest.noise.col <- sapply(seq_along(mydata[,1]),
function(x) which(colnames(mydata) == secondstrongest[x]) +1)
# Use the indices to extract relevant noise values:
mydata$strongestantennanoise <- sapply(seq_along(mydata[,1]),
function(x) mydata[x, biggest.noise.col[x]])
mydata$secondstrongestantennanoise <- sapply(seq_along(mydata[,1]),
function(x) mydata[x, second.biggest.noise.col[x]])
May be you can also try:
dat1 <- dat[,grep("S", colnames(dat))]
Strongest <- do.call(`pmax`, dat1)
Strongest
#[1] -12516 -11485 -10871 -10900 -11452
indx1 <-which(dat1==Strongest,arr.ind=T)
indx11 <- unique(indx1[,2])
SecondStrongest <- do.call(`pmax`, dat1[,-indx])
SecondStrongest
#[1] -12661 -12168 -10982 -11112 -12164
indx2 <- which(SecondStrongest ==dat1,arr.ind=TRUE)
dat2 <- dat[,grep("N", colnames(dat))]
MatchingNoise <- dat2[indx1]
MatchingSecondNoise <- dat2[indx2]

Using "apply" functions across multiple data frames

I'm having an issue using apply functions (which I assume is the right way to do the following) across multiple data frames.
Some example data (3 different data frames, but the problem I'm working on has upwards of 50):
biz <- data.frame(
country = c("england","canada","australia","usa"),
businesses = sample(1000:2500,4))
pop <- data.frame(
country = c("england","canada","australia","usa"),
population = sample(10000:20000,4))
restaurants <- data.frame(
country = c("england","canada","australia","usa"),
restaurants = sample(500:1000,4))
Here's what I ultimately want to do:
1) Sort eat data frame from largest to smallest, according to the variable that's included
dataframe <- dataframe[order(dataframe$VARIABLE,)]
2) then create a vector variable that gives me the rank for each
dataframe$rank <- 1:nrow(dataframe)
3) Then create another data frame that has one column of the countries and the rank for each of the variables of interest as other columns. Something that would look like (rankings aren't real here):
country.rankings <- structure(list(country = structure(c(5L, 1L, 6L, 2L, 3L, 4L), .Label = c("brazil",
"canada", "england", "france", "ghana", "usa"), class = "factor"),
restaurants = 1:6, businesses = c(4L, 5L, 6L, 3L, 2L, 1L),
population = c(4L, 6L, 3L, 2L, 5L, 1L)), .Names = c("country",
"restaurants", "businesses", "population"), class = "data.frame", row.names = c(NA,
-6L))
So I'm guessing there's a way to put each of these data frames together into a list, something like:
lib <- c(biz, pop, restaurants)
And then do an lapply across that to 1) sort, 2)create the rank variable and 3) create the matrix or data frame of rankings for each variable (# of businesses, population size, # of restaurants) for each country. Problem I'm running into is that writing the lapply function to sort each data frame runs into issues when I try to order by the variable:
sort <- lapply(lib,
function(x){
x <- x[order(x[,2]),]
})
returns the error message:
Error in `[.default`(x, , 2) : incorrect number of dimensions
because I'm trying to apply column headings to a list. But how else would I tackle this problem when the variable names are different for every data frame (but keeping in mind that the country names are consistent)
(would also love to know how to use this using plyr)
Ideally I'd would recommend data.table for this.
However, here is a quick solution using data.frame
Try this:
Step1: Create a list of all data.frames
varList <- list(biz,pop,restaurants)
Step2: Combine all of them in one data.frame
temp <- varList[[1]]
for(i in 2:length(varList)) temp <- merge(temp,varList[[i]],by = "country")
Step3: Get ranks:
cbind(temp,apply(temp[,-1],2,rank))
You can remove the undesired columns if you want!!
cbind(temp[,1:2],apply(temp[,-1],2,rank))[,-2]
Hope this helps!!
totaldatasets <- c('biz','pop','restaurants')
totaldatasetslist <- vector(mode = "list",length = length(totaldatasets))
for ( i in seq(length(totaldatasets)))
{
totaldatasetslist[[i]] <- get(totaldatasets[i])
}
totaldatasetslist2 <- lapply(
totaldatasetslist,
function(x)
{
temp <- data.frame(
country = totaldatasetslist[[i]][,1],
countryrank = rank(totaldatasetslist[[i]][,2])
)
colnames(temp) <- c('country', colnames(x)[2])
return(temp)
}
)
Reduce(
merge,
totaldatasetslist2
)
Output -
country businesses population restaurants
1 australia 3 3 3
2 canada 2 2 2
3 england 1 1 1
4 usa 4 4 4

Returning first row of group

I have a dataframe consisting of an ID, that is the same for each element in a group, two datetimes and the time interval between these two. One of the datetime objects is my relevant time marker. Now I like to get a subset of the dataframe that consists of the earliest entry for each group. The entries (especially the time interval) need to stay untouched.
My first approach was to sort the frame according to 1. ID and 2. relevant datetime. However, I wasn't able to return the first entry for each new group.
I then have been looking at the aggregate() as well as ddply() function but I could not find an option in both that just returns the first entry without applying an aggregation function to the time interval value.
Is there an (easy) way to accomplish this?
ADDITION:
Maybe I was unclear by adding my aggregate() and ddply() notes. I do not necessarily need to aggregate. Given the fact that the dataframe is sorted in a way that the first row of each new group is the row I am looking for, it would suffice to just return a subset with each row that has a different ID than the one before (which is the start-row of each new group).
Example data:
structure(list(ID = c(1454L, 1322L, 1454L, 1454L, 1855L, 1669L,
1727L, 1727L, 1488L), Line = structure(c(2L, 1L, 3L, 1L, 1L,
1L, 1L, 1L, 1L), .Label = c("A", "B", "C"), class = "factor"),
Start = structure(c(1357038060, 1357221074, 1357369644, 1357834170,
1357913412, 1358151763, 1358691675, 1358789411, 1359538400
), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1357110430,
1357365312, 1357564413, 1358230679, 1357978810, 1358674600,
1358853933, 1359531923, 1359568151), class = c("POSIXct",
"POSIXt"), tzone = ""), Interval = c(1206.16666666667, 2403.96666666667,
3246.15, 6608.48333333333, 1089.96666666667, 8713.95, 2704.3,
12375.2, 495.85)), .Names = c("ID", "Line", "Start", "End",
"Interval"), row.names = c(NA, -9L), class = "data.frame")
By reproducing the example data frame and testing it I found a way of getting the needed result:
Order data by relevant columns (ID, Start)
ordered_data <- data[order(data$ID, data$Start),]
Find the first row for each new ID
final <- ordered_data[!duplicated(ordered_data$ID),]
As you don't provide any data, here is an example using base R with a sample data frame :
df <- data.frame(group=c("a", "b"), value=1:8)
## Order the data frame with the variable of interest
df <- df[order(df$value),]
## Aggregate
aggregate(df, list(df$group), FUN=head, 1)
EDIT : As Ananda suggests in his comment, the following call to aggregate is better :
aggregate(.~group, df, FUN=head, 1)
If you prefer to use plyr, you can replace aggregate with ddply :
ddply(df, "group", head, 1)
Using ffirst from collapse
library(collapse)
ffirst(df, g = df$group)
data
df <- data.frame(group=c("a", "b"), value=1:8)
This could also be achieved by dplyr using group_by and slice-family of functions,
data %>%
group_by(ID) %>%
slice_head(n = 1)

Counting an event only every X days per subject (in an irregular time series)

I've got data where I'm counting episodes of care (like ER visits). The trick is, I can't count every single visit, because sometimes a 2nd or 3rd visit is actually a follow-up for a previous problem. So I've been given direction to count visits by using a 30 day "clean period" or "black out period", such that, I look for the first event (VISIT 1) by patient (min date), I count that event, then apply rules so as NOT to count any visits that occur in the 30 days following the first event. After that 30 day window has elapsed, I can begin looking for the 2nd visit (VISIT 2), count that one, then apply the 30 day black out again (NOT counting any visits that occur in the 30 days after visit #2)... wash, rinse, repeat...
I have rigged together a very sloppy solution that requires a lot of babysitting and manual checking of steps(see below). I have to believe that there is a better way. HELP!
data1 <- structure(list(ID = structure(c(2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
3L, 4L, 4L, 4L, 4L, 4L), .Label = c("", "patient1", "patient2",
"patient3"), class = "factor"), Date = structure(c(14610, 14610,
14627, 14680, 14652, 14660, 14725, 15085, 15086, 14642, 14669,
14732, 14747, 14749), class = "Date"), test = c(1L, 1L, 1L, 2L,
1L, 1L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 2L)), .Names = c("ID", "Date",
"test"), class = "data.frame", row.names = c(NA, 14L))
library(doBy)
## create a table of first events
step1 <- summaryBy(Date~ID, data = data1, FUN=min)
step1$Date30 <- step1$Date.min+30
step2 <- merge(data1, step1, by.x="ID", by.y="ID")
## use an ifelse to essentially remove any events that shouldn't be counted
step2$event <- ifelse(as.numeric(step2$Date) >= step2$Date.min & as.numeric(step2$Date) <= step2$Date30, 0, 1)
## basically repeat steps above until I dont capture any more events
## there just has to be a better way
data3 <- step2[step2$event==1,]
data3<- data3[,1:3]
step3 <- summaryBy(Date~ID, data = data3, FUN=min)
step3$Date30 <- step3$Date.min+30
step4 <- merge(data3, step3, by.x="ID", by.y="ID")
step4$event <- ifelse(as.numeric(step4$Date) >= step4$Date.min & as.numeric(step4$Date) <= step4$Date30, 0, 1)
data4 <- step4[step4$event==1,]
data4<- data4[,1:3]
step5 <- summaryBy(Date~ID, data = data4, FUN=min)
step5$Date30 <- step5$Date.min+30
## then I rbind the "keepers"
## in this case steps 1 and 3 above
final <- rbind(step1,step3, step5)
## then reformat
final <- final[,1:2]
final$Date.min <- as.Date(final$Date.min,origin="1970-01-01")
## again, extremely clumsy, but it works... HELP! :)
This solution is loop-free and uses only base R. It produces a logical vector ok which selects the acceptable rows of data1.
ave runs the indicated anonymous function over each patient separately.
We define a state vector consisting of the current date and the start of the period for which no other dates are considered. Each date is represented by as.numeric(x) where x is the date. step takes the state vector and the current date and updates the state vector. Reduce runs it over the data and then we take only results for which the minimum and current date are the same and for which the current date is not a duplicate.
step <- function(init, curdate) {
c(curdate, if (curdate > init[2] + 30) curdate else init[2])
}
ok <- !!ave(as.numeric(data1$Date), paste(data1$ID), FUN = function(d) {
x <- do.call("rbind", Reduce(step, d, c(-Inf, 0), acc = TRUE))
x[-1,1] == x[-1,2] & !duplicated(x[-1,1])
})
data1[ok, ]
Since that kind of manipulation is not straightforward and error-prone,
I would write a separate function to discard events in the blackout period.
The function contains a loop,
which basically does what you were doing by hand,
until there is nothing left to do.
blackout <- function(dates, period=30) {
dates <- sort(dates)
while( TRUE ) {
spell <- as.numeric(diff(dates)) <= period
if(!any(spell)) { return(dates) }
i <- which(spell)[1] + 1
dates <- dates[-i]
}
}
# Tests
stopifnot(
length(
blackout( seq.Date(Sys.Date(), Sys.Date()+50, by=1) )
) == 2
)
stopifnot(
length(
blackout( seq.Date(Sys.Date(), by=31, length=5) )
) == 5
)
It can be used as follows.
library(plyr)
ddply(data1, "ID", summarize, Date=blackout(Date))
How about
do.call('rbind', lapply(split(data1, factor(data1$ID)), function(x) (x <- x[order(x$Date),])[c(T, diff(x$Date) > 30),]))

Resources