R: Select value from a different column for each row - r

I have a large data frame (cut down to first 5 rows here) comprised of radio-telemetry readings from multiple antennas. Normally there are 10,000+ rows of data like this every couple of weeks.
structure(list(freq.id = c(13, 13, 13, 13, 13), DT = structure(c(1393835337,
1393921137, 1393879437, 1393881387, 1393920987), class = c("POSIXct",
"POSIXt"), tzone = "America/Bogota"), S1 = c(-13624L, -12866L,
-13291L, -13415L, -13002L), N1 = c(-13969L, -13824L, -13868L,
-13881L, -13911L), S2 = c(-14114L, -14026L, -13957L, -13969L,
-14052L), N2 = c(-14211L, -14238L, -14168L, -14148L, -14211L),
S3 = c(-13245L, -13113L, -12801L, -12860L, -13133L), N3 = c(-13816L,
-13832L, -13878L, -14001L, -13706L), S4 = c(-13479L, -12702L,
-12388L, -12501L, -12692L), N4 = c(-13872L, -13820L, -13992L,
-13905L, -13798L), S5 = c(-12516L, -11485L, -10871L, -10900L,
-11452L), N5 = c(-13884L, -13995L, -13804L, -13840L, -13929L
), S6 = c(-12661L, -12168L, -10982L, -11112L, -12164L), N6 = c(-13911L,
-13914L, -13078L, -13778L, -13911L), PW = c(20L, 20L, 20L,
20L, 21L), PI = c(1078L, 1078L, 1080L, 2156L, 1078L), aru.unk = c(2072L,
2058L, 2014L, 2052L, 2047L), msrfreq = c(164421600L, 164421700L,
164421400L, 164421300L, 164421800L), TOWERID = structure(c(1L,
1L, 1L, 1L, 1L), .Label = c("TOWER4", "TOWER5", "TOWER6",
"TOWER7"), class = "factor"), prog.freq = structure(c(9L,
9L, 9L, 9L, 9L), .Label = c("162.7920", "162.9774", "163.0780",
"163.6804", "163.8600", "164.0309", "164.2930", "164.3950",
"164.4220", "164.4350", "164.5040", "164.5430", "164.5620",
"164.7026", "164.7840", "164.8230", "164.8430", "164.9338",
"165.5000"), class = "factor")), .Names = c("freq.id", "DT",
"S1", "N1", "S2", "N2", "S3", "N3", "S4", "N4", "S5", "N5", "S6",
"N6", "PW", "PI", "aru.unk", "msrfreq", "TOWERID", "prog.freq"
), row.names = 40615:40619, class = "data.frame")
Columns S1,S2...S6 are signal values from different antennas and N1,N2...N6 are corresponding noise values
I am trying to pull out the largest and second largest signal values for each row and their corresponding noise values. I can get the the signal values, as well as it's "index" of just the columns of signal.
maxn <- function(n) function(x) order(x, decreasing = TRUE)[n]
mydata$strongest<-apply(mydata[,c(3,5,7,9,11,13)],1,function(x) x[maxn(1)(x)])
#columns 3,5,6,11,13 are the subset of columns containing signal values
mydata$secondstrongest<-apply(mydata[,c(3,5,7,9,11,13)],1,function(x) x[maxn(2)(x)])
mydata$strongestantenna<-apply(mydata[,c(3,5,7,9,11,13)],1,maxn(1))
# returns 5 because in the first 5 rows, the strongest signal is the 5th antenna (S5)
mydata$secondstrongestantenna<-apply(mydata[,c(3,5,7,9,11,13)],1,maxn(2))
#returns a 6
I'm stuck trying to create 2 new columns that extract the noise values for the antennas that have the 1st and 2nd strongest signals. I was hoping to use the place index (1-6) for each antenna to pull out the correct noise values like this, but it isn't working. It pulls the correct value, but repeats it the same number of times as the value of mydata$strongantenna
mydata$strongantennanoise<-mydata[c(4,6,8,10,12,14)][mydata$strongestantenna],
#Columns 4,6,8,10,and 12 are the noise values
The strongest and second strongest antennas don't change here, but do in the data, as the animal being tracked moves around.
I feel like I'm overlooking something simple, but I can't figure it out. I appreciate whatever help you can give.

# Get names of the strongest and second strongest antennas by row:
strongest <- apply(mydata[,c(3,5,7,9,11,13)],1, function(x) names(x[maxn(1)(x)]))
secondstrongest <- apply(mydata[,c(3,5,7,9,11,13)],1, function(x) names(x[maxn(2)(x)]))
# Get column index for associated noise columns
biggest.noise.col <- sapply(seq_along(mydata[,1]),
function(x) which(colnames(mydata) == strongest[x]) +1)
second.biggest.noise.col <- sapply(seq_along(mydata[,1]),
function(x) which(colnames(mydata) == secondstrongest[x]) +1)
# Use the indices to extract relevant noise values:
mydata$strongestantennanoise <- sapply(seq_along(mydata[,1]),
function(x) mydata[x, biggest.noise.col[x]])
mydata$secondstrongestantennanoise <- sapply(seq_along(mydata[,1]),
function(x) mydata[x, second.biggest.noise.col[x]])

May be you can also try:
dat1 <- dat[,grep("S", colnames(dat))]
Strongest <- do.call(`pmax`, dat1)
Strongest
#[1] -12516 -11485 -10871 -10900 -11452
indx1 <-which(dat1==Strongest,arr.ind=T)
indx11 <- unique(indx1[,2])
SecondStrongest <- do.call(`pmax`, dat1[,-indx])
SecondStrongest
#[1] -12661 -12168 -10982 -11112 -12164
indx2 <- which(SecondStrongest ==dat1,arr.ind=TRUE)
dat2 <- dat[,grep("N", colnames(dat))]
MatchingNoise <- dat2[indx1]
MatchingSecondNoise <- dat2[indx2]

Related

How do I merge two large data.frames and take a select portion of these values?

specdata <- list.files(getwd(), pattern="*.csv")
directory <- lapply(specdata, read.csv)
directory_final <- do.call(rbind, directory)
library(tidyverse)
one <- select(directory_final, nitrate, ID)
two <- no.omit(one)
a <- select(directory_final, sulfate, ID)
b <- na.omit(a)
two_df <- mutate(two, id = rownames(two))
b_df <- mutate(b, id = rownames(b))
library(plyr)
alpha <- join(two_df, b_df, by = "id", match = "all")
alpha$id <- NULL
dput(head(alpha, 5))
structure(list(sulfate = c(7.21, 5.99, 4.68, 3.47, 2.42), ID = c(1L,
1L, 1L, 1L, 1L), nitrate = c(0.651, 0.428, 1.04, 0.363, 0.507
), ID = c(1L, 1L, 1L, 1L, 1L)), row.names = c(NA, 5L), class = "data.frame")
dim(alpha)
118783 4
Think of it like this; I have two long strings, one string extends for 10m and the other 12m. One string is red and the other blue. both strings have knots at 0.05 cm intervals all along the entire string. At every 10 knots, I give each individual knot the ID-1 for red and ID1-1 for blue and so forth. I have each string on each hand, however; I want these two strings to be one long string, merged side-by-side. So I tie the top and end of the string. Now if I want an individual knot, from ID-1, 1/10 length of the ID-1 string, I untie the first and so forth. – I want a function that lets me find the mean of every knot I untie either from ID-1 ranging from 1:332, or ID1-1 ranging from 1:332.
I want something like
alpha_function(nitrate, ID = 1:50)
alpha_function(sulfate, ID = 1:50)
A function that can gather all the mean values of nitrate or sulfate by ID
also, when I use the 'join' function, I can only take mean values of the first data.frame (b_df), that I place in this function. whereas, the second always returns NA.
mean(alpha$sulfate)
3.189369
mean(alpha$nitrate)
NA
I would like to also know as to why this happens and how it can be fixed so both total values can be taken?
The following function might help:
combine.df <- function(df1,df2){
n <- max(nrow(df1),nrow(df2))
cbind(df1[1:n,],df2[1:n,])
}
The logic of the function is that R automatically inserts NA when you give it indices which are out of range.
In the event that the dataframes have differing amount of rows, the excess rows will have names like NA, NA.1, NA.2, .... If you don't like that then you could use the following version of this function:
combine.df <- function(df1,df2){
n <- max(nrow(df1),nrow(df2))
df <- cbind(df1[1:n,],df2[1:n,])
row.names(df) <- 1:n
df
}

Calculating value between two csv in R

I have a two data table (csv) which contain information about a MOOC course.
The first table contains information about mouse movments (distance). Like this:
1-2163.058../2-20903.66351.../3-25428.5415..
The first number means the day (1- first day, 2- second day, etc.) when it happens, the second number means the distance in pixel. (2163.058, 20903.66351, etc.).
The second table contains the same information but instead of distance, there is the time was recorded. Like this:
1-4662.0/2-43738.0/3-248349.0....
The first number means the day (1- first day, 2- second day, etc.) when it happens, the second number means the time in milliseconds.
In the table, every column records a data from the specific web page, and every row records a user behaviour on this page.
I want to create a new table with the same formation, where I want to count the speed by pixel. Divide the distance table with time table which gives new table with same order, shape.
Here are two links for the two tables goo.gl/AVQW7D goo.gl/zqzgaQ
How can I do this with raw csv?
> dput(distancestream[1:3,1:3])
structure(list(id = c(2L, 9L, 10L),
`http//tanul.sed.hu/mod/szte/frontpage.php` = structure(c(2L, 1L, 1L),
.Label = c("1-0", "1-42522.28760403924"),
class = "factor"),
`http//tanul.sed.hu/mod/szte/register.php` = c(0L, 0L, 0L)),
.Names = c("id", "http//tanul.sed.hu/mod/szte/frontpage.php",
"http//tanul.sed.hu/mod/szte/register.php"),
class = c("data.table", 0x0000000002640788))
> dput(timestream[1:3,1:3])
structure(list(id = c(2L, 9L, 10L),
`http//tanul.sed.hu/mod/szte/frontpage.php` = structure(c(2L, 1L, 1L),
.Label = c("0", "1-189044.0"),
class = "factor"),
`http//tanul.sed.hu/mod/szte/register.php` = c(0L, 0L, 0L)),
.Names = c("id",
"http//tanul.sed.hu/mod/szte/frontpage.php",
"http//tanul.sed.hu/mod/szte/register.php"),
class = c("data.table", 0x0000000002640788))
This may not be the most efficient method, but I believe it should yield the result you are looking for.
# Set file paths
dist.file <- # C:/Path/To/Distance/File.csv
time.file <- # C:/Path/To/Time/File.csv
# Read data files
dist <- read.csv(dist.file, stringsAsFactors = FALSE)
time <- read.csv(time.file, stringsAsFactors = FALSE)
# Create dataframe for speed values
speed <- dist
speed[,2:ncol(speed)] <- NA
# Create progress bar
pb <- txtProgressBar(min = 0, max = ncol(dist) * nrow(dist), initial = 0, style = 3, width = 20)
item <- 0
# Loop through all columns and rows of distance data
for(col in 2:ncol(dist)){
for(r in 1:nrow(dist)){
# Check that current item has data to be calculated
if(dist[r,col] != 0 & dist[r,col] != "1-0" & !is.na(time[r,col])){
# Split the data into it's separate day values
dists <- lapply(strsplit(strsplit(dist[r,col], "/")[[1]], "-"), as.numeric)
times <- lapply(strsplit(strsplit(time[r,col], "/")[[1]], "-"), as.numeric)
# Calculate the speeds for each day
speeds <- sapply(dists, "[[", 2) / sapply(times, "[[", 2)
# Paste together the day values and assign to the current item in speed dataframe
speed[r,col] <- paste(sapply(dists, "[[", 1), format(speeds, digits = 20), sep = "-", collapse = "/")
} else{
# No data to calculate, assign 0 to current item in speed dataframe
speed[r,col] <- 0
}
# Increase progress bar counter
item <- item + 1
setTxtProgressBar(pb,item)
}
}
# Create a csv for speed data
write.csv(speed, "speed.csv")

Calculating percent of categorical responses (with grouping) in R

I have the following dataframe:
IV Device1 Device2 Device3
Color Same Same Missing
Color Different Same Missing
Color Same Unique Missing
Shape Same Missing Same
Shape Different Same Different
Explanation: each IV (Independent Variable) is composed of several measurements (the ‘Color’ section is composed of 3 different measurements, while 'Shape' is composed of 2).
Each data point has one of 4 possible categorical values: Same/Different/Unique/Missing. 'Missing' means that there is no value for that measurement in the case of that device, while the other 3 values represent the existing result for that measurement.
Question: I want to calculate for each device the percent of times that it has a Same/Different/Unique value (thus generating 3 different percentages), out of the total number of values for that IV (not including cases where there is a ‘Missing’ value).
For example, device 2 would have the following percentages:
Color- 67% same, 0% different, 33% unique.
Shape- 100% same, 0% different, 0% unique.
Thank you!
This is a not a TIDY solution, but you can use this until someone else posts a better one:
# Replace all "Missing" with NAs
df[df == "Missing"] <- NA
# Create factor levels
df[,-1] <- lapply(df[,-1], function(x) {
factor(x, levels = c('Same', 'Different', 'Unique'))
})
# Custom function to calculate percent of categorical responses
custom <- function(x) {
y <- length(na.omit(x))
if(y > 0)
return(round((table(x)/y)*100))
else
return(rep(0, 3))
}
library(purrr)
# Split the dataframe on IV, remove the IV column and apply the custom function
Final <- df %>% split(df$IV) %>%
map(., function(x) {
x <- x[, -1]
t(sapply(x, custom))
})
Output
Final is a list of two data frames:
$Color
Same Different Unique
Device1 67 33 0
Device2 67 0 33
Device3 0 0 0
$Shape
Same Different Unique
Device1 50 50 0
Device2 100 0 0
Device3 50 50 0
Data
structure(list(IV = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("Color",
"Shape"), class = "factor"), Device1 = structure(c(1L, 2L, 1L,
1L, 2L), .Label = c("Same", "Different", "Unique"), class = "factor"),
Device2 = structure(c(1L, 1L, 3L, NA, 1L), .Label = c("Same",
"Different", "Unique"), class = "factor"), Device3 = structure(c(NA,
NA, NA, 1L, 2L), .Label = c("Same", "Different", "Unique"
), class = "factor")), .Names = c("IV", "Device1", "Device2",
"Device3"), row.names = c(NA, -5L), class = "data.frame")
Quick and dirty: First, replace your 'Missing' by 'NA' using your preferred method (sed, excel, etc), then you can use table on each of the columns to get the summary statistics:
myStats <- function(x){
table(factor(x, levels = c('Same', 'Different', 'Unique')))/sum(table(x))
}
apply(yourData, 2, myStats)
This will return the summary of what you want.

Unsplitting a data frame by a variable (different length of factors)

I have a data frame (st1) that I split by a factor. I then performed functions to the split data (i.e. mean) by another factor and hence, I cannot perform unsplit any more because my original data frame is of different length now.
As to walk you through what I did, here is a code:
NT = data.table(st1)
NT2=split (NT, NT$bin)
NT3 <- data.frame(sapply( NT2 , function(x) x[, list(ang=length(unique(thetadeg)), len=length(T), Vm=mean(V)), by=c("A")]))
head of the st1:
structure(list(A = c(25L, 25L, 25L, 25L, 25L, 25L), T = 56:61,
X = c(481.07, 487.04, 490.03, 499, 504.97, 507.96), Y = c(256.97,
256.97, 256.97, 256.97, 256.97, 256.97), V = c(4.482, 5.976,
7.47, 4.482, 5.976, 7.47), thetarad = c(0.164031585831919,
0.169139558949956, 0.171661200692621, 0.179083242584008,
0.183907246800473, 0.186289411097781), thetadeg = c(9.39831757286096,
9.69098287432395, 9.83546230358968, 10.2607139792383, 10.537109061132,
10.6735970214433), bin = structure(c(1L, 1L, 1L, 1L, 1L,
1L), .Label = c("binA", "binB", "binC", "outbin"), class = "factor")), .Names = c("A", "T", "X", "Y", "V", "thetarad",
"thetadeg", "bin"), row.names = c(NA, 6L), class = "data.frame")
I did not put a dput(head) for my NT3 because it will be too long.
I tried unsplit, unlist but am not successful. What I want to do is to have one data frame again with the bin as a factor.
Any help would be great.
edit: What I would like my data frame to have is A, ang, len, Vm, and bin as headers.
It's not altogether clear what your intended output is, but looking at what you have for NT3, this may be more effective:
NT <- data.table(ST1, key="A")
NT[, list(ang=length(unique(thetadeg))
, len=length(T)
, Vm=mean(V))
, by=list(A, bin) ]
I managed to find what I did wrong, so this now works:
NT <- data.table(st1, key="bin")
NT2=NT[, list(ang=length(unique(thetadeg)), len=length(T), Vm=mean(V)), by=c("A", "bin")]
Apparently I could already do in data.table the by statement which was also suggested by #Ricardo Saporta. Thank you for that!

Counting an event only every X days per subject (in an irregular time series)

I've got data where I'm counting episodes of care (like ER visits). The trick is, I can't count every single visit, because sometimes a 2nd or 3rd visit is actually a follow-up for a previous problem. So I've been given direction to count visits by using a 30 day "clean period" or "black out period", such that, I look for the first event (VISIT 1) by patient (min date), I count that event, then apply rules so as NOT to count any visits that occur in the 30 days following the first event. After that 30 day window has elapsed, I can begin looking for the 2nd visit (VISIT 2), count that one, then apply the 30 day black out again (NOT counting any visits that occur in the 30 days after visit #2)... wash, rinse, repeat...
I have rigged together a very sloppy solution that requires a lot of babysitting and manual checking of steps(see below). I have to believe that there is a better way. HELP!
data1 <- structure(list(ID = structure(c(2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
3L, 4L, 4L, 4L, 4L, 4L), .Label = c("", "patient1", "patient2",
"patient3"), class = "factor"), Date = structure(c(14610, 14610,
14627, 14680, 14652, 14660, 14725, 15085, 15086, 14642, 14669,
14732, 14747, 14749), class = "Date"), test = c(1L, 1L, 1L, 2L,
1L, 1L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 2L)), .Names = c("ID", "Date",
"test"), class = "data.frame", row.names = c(NA, 14L))
library(doBy)
## create a table of first events
step1 <- summaryBy(Date~ID, data = data1, FUN=min)
step1$Date30 <- step1$Date.min+30
step2 <- merge(data1, step1, by.x="ID", by.y="ID")
## use an ifelse to essentially remove any events that shouldn't be counted
step2$event <- ifelse(as.numeric(step2$Date) >= step2$Date.min & as.numeric(step2$Date) <= step2$Date30, 0, 1)
## basically repeat steps above until I dont capture any more events
## there just has to be a better way
data3 <- step2[step2$event==1,]
data3<- data3[,1:3]
step3 <- summaryBy(Date~ID, data = data3, FUN=min)
step3$Date30 <- step3$Date.min+30
step4 <- merge(data3, step3, by.x="ID", by.y="ID")
step4$event <- ifelse(as.numeric(step4$Date) >= step4$Date.min & as.numeric(step4$Date) <= step4$Date30, 0, 1)
data4 <- step4[step4$event==1,]
data4<- data4[,1:3]
step5 <- summaryBy(Date~ID, data = data4, FUN=min)
step5$Date30 <- step5$Date.min+30
## then I rbind the "keepers"
## in this case steps 1 and 3 above
final <- rbind(step1,step3, step5)
## then reformat
final <- final[,1:2]
final$Date.min <- as.Date(final$Date.min,origin="1970-01-01")
## again, extremely clumsy, but it works... HELP! :)
This solution is loop-free and uses only base R. It produces a logical vector ok which selects the acceptable rows of data1.
ave runs the indicated anonymous function over each patient separately.
We define a state vector consisting of the current date and the start of the period for which no other dates are considered. Each date is represented by as.numeric(x) where x is the date. step takes the state vector and the current date and updates the state vector. Reduce runs it over the data and then we take only results for which the minimum and current date are the same and for which the current date is not a duplicate.
step <- function(init, curdate) {
c(curdate, if (curdate > init[2] + 30) curdate else init[2])
}
ok <- !!ave(as.numeric(data1$Date), paste(data1$ID), FUN = function(d) {
x <- do.call("rbind", Reduce(step, d, c(-Inf, 0), acc = TRUE))
x[-1,1] == x[-1,2] & !duplicated(x[-1,1])
})
data1[ok, ]
Since that kind of manipulation is not straightforward and error-prone,
I would write a separate function to discard events in the blackout period.
The function contains a loop,
which basically does what you were doing by hand,
until there is nothing left to do.
blackout <- function(dates, period=30) {
dates <- sort(dates)
while( TRUE ) {
spell <- as.numeric(diff(dates)) <= period
if(!any(spell)) { return(dates) }
i <- which(spell)[1] + 1
dates <- dates[-i]
}
}
# Tests
stopifnot(
length(
blackout( seq.Date(Sys.Date(), Sys.Date()+50, by=1) )
) == 2
)
stopifnot(
length(
blackout( seq.Date(Sys.Date(), by=31, length=5) )
) == 5
)
It can be used as follows.
library(plyr)
ddply(data1, "ID", summarize, Date=blackout(Date))
How about
do.call('rbind', lapply(split(data1, factor(data1$ID)), function(x) (x <- x[order(x$Date),])[c(T, diff(x$Date) > 30),]))

Resources