I have large data frame tocalculate from a survey (original data frame brfss2013 where one of the variables represents the number of times a person checks blood glucose levels. The data is in 3 digits:
First digit tells you if the measurements are per day (1), per week (2), per month (3)or per year (4). The second and third digits represent the actual value.
Example: 101 is once ( _01) per day (1 _ _), 202 is twice per week, etc.
I want to standardize everything to get value of times per year. So I will multiply the 2nd and 3rd digits by 365, 52.143, 12 and 1 (days, weeks, months, year).
I think I would be able to "select" the digits to use, but I'm not sure how to write something that can work with different rows with different set of instructions.
EDIT:
Adding my attempt and sample data.
tocalculate <- brfss2013 %>%
filter(nchar(bldsugar) > 2)
bldsugar2 <- sapply(tocalculate$bldsugar, function(x) {
if (substr(x,1,1) == 1) {x*365}
if (substr(x,1,1) == 2) {x*52}
if (substr(x,1,1) == 3) {x*12}
if (substr(x,1,1) == 4) {x*365}
})
I'm getting a lot of NULL values though...
Since you're already using dplyr, recode is a handy function. I use %/% to see how many times 100 goes in to each bldsugar value and %% to get the remainder when divided by 100.
# sample data
brfss_sample = data.frame(bldsugar = c(101, 102, 201, 202, 301, 302, 401, 402))
library(dplyr)
mutate(
brfss_sample,
mult = recode(
bldsugar %/% 100,
`1` = 365.25,
`2` = 52.143,
`3` = 12,
`4` = 1
),
checks_per_year = bldsugar %% 100 * mult
)
# bldsugar mult checks_per_year
# 1 101 365.250 365.250
# 2 102 365.250 730.500
# 3 201 52.143 52.143
# 4 202 52.143 104.286
# 5 301 12.000 12.000
# 6 302 12.000 24.000
# 7 401 1.000 1.000
# 8 402 1.000 2.000
You could, of course, remove the mult column (or combine the definitions so it is never created in the first place).
#Data
set.seed(42)
x = sample(101:499, 100, replace = TRUE)
#1st digit
as.factor(floor((x/100)))
#Values
((x/100) %% 1) * 100
Perhaps the first thing you can do is to split the 3-digit variable into two variables. The first variable is only one digit, which shows sampling frequency; and the second variable shows times of measurement.
In R, substr or substring can select the string by specifying the first and last position to subset.
# Create example data frame
ex_data <- data.frame(var = c("101", "202", "204"))
# Split the variable to create two new columns
ex_data$var1 <- substring(ex_data$var, first = 1, last = 1)
ex_data$var2 <- substring(ex_data$var, first = 2, last = 3)
# Remove the original variable
ex_data$var <- NULL
After this, you can manipulate your data frame. Perhaps convert var1 to factor and var2 to numeric for further manipulation and analysis.
Related
I have a dataset that looks like the following:
Call No. Arrival Time Call Length (in hrs) ...
================================================
1 0.01 0.061
2 0.08 0.05
3 0.10 (Busy/Unanswered)
4 0.15 0.42
...
10 1.03 0.36
11 1.09 0.72
...
I want to count the number of phone calls each hour (e.g. number of successful phone calls from arrival times [0, 1), [1, 2), [2, 3), etc.
There are some empty values in the call length column, indicating that the phone line was busy, so the call went unanswered. I basically want to count the nonempty occurrences of the call length and group them by hour by summing them. How can I do this using dataframe operations in R?
Perhaps this helps
library(dplyr)
df1 %>%
group_by(Hour = round(`Arrival Time`)) %>%
dplyr::summarise(Total_phone_calls =
sum(complete.cases(`Call Length (in hrs)`)))
Or remove the NA elements in the Call Length (in hrs)) column and use n() or count
library(tidyr)
df1 %>%
drop_na(`Call Length (in hrs)`) %>%
count(Hour = round(`Arrival Time`))
Chiming in with a base R solution
dat <- data.frame(`Call No.` = c(1,2,3,4,10,11),
`Arrival Time` = c(0.01,0.08,0.10,0.15,1.03,1.09),
`Call Length (in hrs)` = c(0.61, 0.05, NA, 0.42, 0.36, 0.72),
check.names = F) # to keep the spaces
# filter out NAs
dat2 <- dat[complete.cases(dat),]
# add an hour variable
dat2$hour <- floor(dat2$`Arrival Time`)
# for fun, create a function that takes in a df
count_and_sum <- function(df){
return(data.frame(hour = df$hour[1], # assumes we will pass it dfs with 1 hour only
answered_calls = nrow(df),
total_call_time_hrs = sum(df$`Call Length (in hrs)`)))
}
# use split to separate the data into a list of data.frames by hour
# added a step but might be better to do in one row for memory
splitted <- split(dat2, dat2$hour, drop= T)
# use sapply to apply our function to each element of the splitted list
# and transpose to make the output the right orientation
t(sapply(splitted, count_and_sum))
# hour answered_calls total_call_time_hrs
#0 0 3 1.08
#1 1 2 1.08
I have a data.frame "DF" of 2020 observations and 79066 variables.
The first column is the "Year" spanning continuously from 1 to 2020, the others variables are the values.
In the first instance, I did an average by row in order to have one mean value per year.
E.g.
Aver <- apply(DF[,2:79066], 1, mean, na.rm=TRUE)
However, I would like to do a weighted average and the weight values differ based on columns string values.
The header name of the variables is "Year" (first column) followed by 79065 columns, where the name of each column is composed of a string that starts from 50 to 300, followed by ".R" repeated from 1 to 15 times, and the ".yr" from 10 to 30. This brings 251(50-300) x 15(R) x 21(10-30) = 79065 columns
E.g. : "Year", "50.R1.10.yr", "50.R1.11.yr", "50.R1.12.yr", ... "50.R1.30.yr", "51.R1.10.yr", "51.R1.11.yr", "51.R1.12.yr", ... "51.R1.30.yr", ..."300.R1.10.yr", "300.R1.11.yr", "300.R1.12.yr", ... "300.R1.30.yr", "50.R2.10.yr", "50.R2.11.yr", "50.R2.12.yr", ... "50.R2.30.yr", "51.R2.10.yr", "51.R2.11.yr", "51.R2.12.yr", ... "51.R2.30.yr", ..."300.R2.10.yr", "300.R2.11.yr", "300.R2.12.yr", ... "300.R2.30.yr", ... "50.R15.10.yr", "50.R15.11.yr", "50.R15.12.yr", ... "300.R15.30.yr".
The weight I would like to assign to each column is based on the string values 50 to 300. I would like to give more weight to values on the column "50." and following a power function, less weight to "300.".
The equation fitting my values is a power function: y = 2305.2*x^-1.019.
E.g.
av.classes <- data.frame(av=seq(50, 300, 1))
library(dplyr)
av.classes.weight <- av.classes %>% mutate(weight = 2305.2*av^-1.019)
Thank you for any help.
I guess you could get your weight vector like this:
library(tidyverse)
weights_precursor <- str_split(names(data)[-1], pattern = "\\.", n = 2, simplify = TRUE)[, 1] %>%
as.numeric()
weights <- 2305.2 * weights_precursor ^ -1.019
Setting up some sample data:
DF <- data.frame(year=2020,`50.R1.10.yr`=1,`300.R15.30.yr`=10)
names(DF) <- stringr::str_remove(names(DF),"X")
Getting numerical vector:
weights <- stringr::str_split(names(DF),"\\.")
weights <- sapply(1:length(weights),function(x) weights[[x]][1])[-1]
as.numeric(weights)
I want to simulate a time series data frame that contains observations of 5 variables that were taken on 10 individuals. I want the number of rows (observations) to be different between each individual. For instance, I could start with something like this:
ID = rep(c("alp", "bet", "char", "delta", "echo"), times = c(1000,1200,1234,980,1300))
in which case ID represents each unique individual (I would later turn this into a factor), and the number of times each ID was repeated would represent the length of measurements for that factor. I would next need to create a column called Time with sequences from 1:1000, 1:1200, 1:1234, 1:980, and 1:1300 (to represent the length of measurements for each individual). Lastly I would need to generate 5 columns of random numbers for each of the 5 variables.
There are tons of ways to go about generating this data set, but what would be the most practical way to do it?
You can do :
ID = c("alp", "bet", "char", "delta", "echo")
num = c(1000,1200,1234,980,1300)
df <- data.frame(ID = rep(ID, num), num = sequence(num))
df[paste0('rand', seq_along(ID))] <- rnorm(length(ID) * sum(num))
head(df)
# ID num rand1 rand2 rand3 rand4 rand5
#1 alp 1 0.1340386 0.95900538 0.84573154 0.7151784 -0.07921171
#2 alp 2 0.2210195 1.67105483 -1.26068288 0.9171749 -0.09736927
#3 alp 3 1.6408462 0.05601673 -0.35454240 -2.6609228 0.21615254
#4 alp 4 -0.2190504 -0.05198191 -0.07355602 1.1102771 0.88246516
#5 alp 5 0.1680654 -1.75323736 -1.16865142 -0.4849876 0.20559750
#6 alp 6 1.1683839 0.09932759 -0.63474826 0.2306168 -0.61643584
I have used rnorm here, you can use any other distribution to generate random numbers.
I want to check if there is any row in one dataframe ("apx") where an entry from the "AD" column in apx matches an entry in the "AD" column in another dataframe ("npx"), AND, where the SD entry from the matched row is within 13 units of the other.
I've checked several different references on SO, but couldn't find an answer due to my need to build a third dataframe (and other reasons).
My working trial is this...
npx <- data.frame(TN = c(111, "Z2", 4, "fox", 34256, 4782, "ZGJU45"),
SD=c( 100, 200, 100, 600, 500, 115, 455),
AD=c( "34YY", "37PD", "123M", "235W", "37PD", "123M", "1WW"))
apx <- data.frame(TN = c(222, "X34", 5, "bear", 47789, 37281, "VF456"),
SD = c(101, 201, 310, 450, 515, 660, 505),
AD = c("123M", "23XY", "5S S", "1WW", "27 30R", "14M", "37PD"))
Note: The AD entries "123M" "1WW", and "37PD" appear in apx and in npx. The first and third of these appear twice in npx.
Insure factors are changed to characters:
i <- sapply(apx, is.factor)
apx[i] <- lapply(apx[i], as.character)
i <- sapply(npx, is.factor)
npx[i] <- lapply(npx[i], as.character)
My fifth try...(forcing SD entries to integers)...
test5 <- apx[which(apx$AD == npx$AD &
as.integer(npx$SD) - as.integer(apx$SD) < 13)
%in% as.integer(npx$SD), ]
One of my earlier tries....
test3 <- apx[which(apx$AD == npx$AD &
as.integer(npx$SD) - as.integer(apx$SD) < 13)
%in% setequal(npx$SD, apx$SD), ]
What I am looking for in a third dataframe is....
TN SD AD
[1] 222 101 123M
because 123M (first row of apx) is found in the third row of npx and the corresponding entries for SD are within 13 units of each other (100 and 101); However, at the second occurence of 123M in npx (in row six), the difference between the corresponding entries for SD are 15 units apart. Actually, I'm looking for only those instances where the SD entry in npx is < 13 greater (only) than the corresponding SD entry in apx.
[2] bear 450 1WW
because 1WW (4th row of apx) is found in the last row of npx and the corresponding entries for SD are within 13 units of each other (450 and 455).
[3] VF456 505 37PD
While 37PD (last row of apx) is found in the second row of npx, that entry doesn't quality 37PD since the corresponding SD values are in excess of 13 units apart (200 and 505); However, the corresponding entries of SD for the other appearance of 37PD in npx (row five) are within 13 units of each other, thereby qualifying 37PD to appear in the resulting dataframe.
I'm gritting my teeth expecting someone to show me a very simple way to do this, but rather suffer that embarrassment than spin more wheels. Thanks in advance.
If I understand what you're trying to do, I think we can use the merge and subset functions:
merge_df <- merge(npx, apx, by = 'AD', suffixes = c('npx','apx'))
subset(merge_df, SDnpx - SDapx <= 13 & SDnpx >= SDapx)
AD TNnpx SDnpx TNapx SDapx
3 1WW ZGJU45 455 bear 450
But I'll admit that I don't actually quite understand just what your condition you're trying to enforce. If we're interested in rows which have an SD difference <= 13, then we can do the following:
subset(merge_df, abs(SDnpx - SDapx) <= 13)
AD TNnpx SDnpx TNapx SDapx
1 123M 4 100 222 101
3 1WW ZGJU45 455 bear 450
5 37PD 34256 500 VF456 505
Then getting the data into your final desired form (which isn't quite clear either) is just renaming and/or dropping columns from the data.frame.
I have a set of data in a csv file that I need to group based on transitions of one column. I'm new to R and I'm having trouble finding the right way to accomplish this.
Simplified version of data:
Time Phase Pressure Speed
1 0 0.015 0
2 25 0.015 0
3 25 0.234 0
4 25 0.111 0
5 0 0.567 0
6 0 0.876 0
7 75 0.234 0
8 75 0.542 0
9 75 0.543 0
The length of time that phase changes state is longer than above but I shortened everything to make it readable and this pattern continues on and on. What I'm trying to do is calculate the mean of pressure and speed for each instance where the phase is non-zero. For example, in the output from the sample above there would be two lines, one with the average of the three lines where phase is 25, and with the average of the three lines when phase is 75. It will be possible to see cases where the same numeric value of phase shows up more than once, and I need to treat each of those separately. That is, in the case where phase is 0, 0, 25, 25, 25, 0, 0, 0, 25, 25, 0, I would need to record the first group and the second group of 25s as separate events, as well as any other non-zero groups.
What i've tried:
`csv <- read.csv("c:\\test.csv")`
`ins <- subset(csv,csv$Phase == 25)`
`exs <- subset(csv,csv$Phase == 75)`
`mean(ins$Pressure)`
`mean(exs$Pressure)`
This obviously returns the average of the entire file when phase is 25 and 75, but I need to somehow split it into groups using the trailing and leading 0s. Any help is appreciated.
Super quick:
df <- read.csv("your_file_name.csv")
cbind(aggregate(Pressure ~ Phase, df[df$Phase != 0,], FUN = mean),
aggregate(Speed ~ Phase, df[df$Phase != 0,], FUN = mean)[2])
The cbind is fancy - depending on the distribution of values of Phase, you'll need to merge instead.
EDITED: Based on feedback from the asker, they are really seeking to do some aggregations across runs of numbers (i.e. the first group of continuous 25s, then the second group of continuous 25s, and so on). Because of that, I suggest using rle or the run-level encoding function, to get a group number that you can use in the aggregate command.
I've modified the original data so that it contains two runs of 25, just for illustrative purposes, but it should work regardless. Using rle we get the encoded runs of data, and then we create a group number for each row. We do this by getting a vector of the total number of observed lengths, and then using the rep function to repeat each one by the appropriate length.
After this is done, we can use the same basic aggregation command again.
df_example <- data.frame(Time = 1:9,
Phase = c(0,25,25,25,0,0,25,25,0),
Pressure = c(0.015,0.015,0.234,0.111,0.567,0.876,0.234,0.542,0.543),
Speed = rep(x = 0,times = 9))
encoded_runs <- rle(x = df_example$Phase)
df_example$Group_No <- rep(x = 1:length(x = encoded_runs$lengths),
times = encoded_runs$lengths)
aggregate(x = df_example[df_example$Phase != 0,c("Pressure","Speed")],
by = list(Group_No = df_example[df_example$Phase != 0,"Group_No"],
Phase = df_example[df_example$Phase != 0,"Phase"]),
FUN = mean)
Group_No Phase Pressure Speed
1 2 25 0.120 0
2 4 25 0.388 0
Building upon comment by Solos, and answer by Cheesman,
try:
csv$block = paste(csv$Phase, cumsum(c(1, diff(csv$Phase) != 0)))
df_example = csv
aggregate(x = df_example[df_example$Phase != 0,c("Pressure","Speed")],
by = list(Phase = df_example[df_example$Phase != 0,"block"]),
FUN = mean)
actually plyr would be handy:
csv$block = paste(csv$Phase, cumsum(c(1, diff(csv$Phase) != 0)))
require(plyr)
ddply(csv[csv$Phase!=0,], .(block), summarize,
mean.Pressure=mean(Pressure), mean.Speed=mean(Speed))