How to setup two dynamic conditions in SUMIFS like problem in R? - r

I already tried my best but am still pretty much a newbie to R.
Based on like 500mb of input data that currently looks like this:
TOTALLISTINGS
listing_id calc.latitude calc.longitude reviews_last30days
1 2818 5829821 335511.0 1
2 20168 5829746 335265.2 3
3 25428 5830640 331534.6 0
4 27886 5832156 332003.1 3
5 28658 5830888 329727.2 3
6 28871 5829980 332071.3 7
I need to calculate the conditional sum of reviews_last30days - the conditions being a specific and changing area range for each respective record, i.e. R should sum only those reviews for which the calc.latitude and calc.longitude do not deviate more than +/-500 from the longitude and latitude values in each row.
EXAMPLE:
ROW 1 has a calc.latitude 5829821 and a calc.longitude 335511.0, so R should take the sum of all reviews_last30days for which the following ranges apply: calc.latitude 5829321‬ to 5830321‬ (value of Row 1 latitude +/-500)
calc.longitude 335011.0 to 336011.0 (value of Row 1 longitude +/-500)
So my intended output would look somewhat like this in column 5:
TOTALLISTINGS
listing_id calc.latitude calc.longitude reviews_last30days reviewsper1000
1 2818 5829821 335511.0 1 4
2 20168 5829746 335265.2 3 4
3 25428 5830640 331534.6 0 10
4 27886 5832156 332003.1 3 3
5 28658 5830888 331727.2 3 10
6 28871 5829980 332071.3 7 10
Hope I calculated correctly in my head, but you get the idea..
Until now I particularly struggle with the fact that my sum conditions are dynamic and "newly assigned" since the latitude and longitude conditions have to be adjusted for each record.
My current code looks like this but it obviously doesn't work that way:
review1000 <- function(TOTALLISTINGS = NULL){
# tibble to return
to_return <- TOTALLISTINGS %>%
group_by(listing_id) %>%
summarise(
reviews1000 = sum(reviews_last30days[(calc.latitude>=(calc.latitude-500) | calc.latitude<=(calc.latitude+500))]))
return(to_return)
}
REVIEWPERAREA <- review1000(TOTALLISTINGS)
I know I also would have to add something for longitude in the code above
Does anyone have an idea how to fix this?
Any help or hints highly appreciated & thanks in advance! :)

See whether the below code will help.
TOTALLISTINGS$reviews1000 <- sapply(1:nrow(TOTALLISTINGS), function(r) {
currentLATI <- TOTALLISTINGS$calc.latitude[r]
currentLONG <- TOTALLISTINGS$calc.longitude[r]
sum(TOTALLISTINGS$reviews_last30days[between(TOTALLISTINGS$calc.latitude,currentLATI - 500, currentLATI + 500) & between(TOTALLISTINGS$calc.longitude,currentLONG - 500, currentLONG + 500)])
})

Related

R: Running multiple tests by selecting (and increasing) number of fixed data points selected - Followup

This is a follow-up from a previous post (R: Running multiple tests by selecting (and increasing) number of fixed data points selected):
I have a dataframe (saved as data.csv) that looks something like this:
person
outcome
baseline_post
time
1
0
baseline
BL_1
1
1
baseline
BL_2
1
0
baseline
BL_3
1
2
baseline
BL_4
1
4
post
post_1
1
3
post
post_2
1
4
post
post_3
1
6
post
post_4
2
1
baseline
BL_1
2
2
baseline
BL_2
2
0
baseline
BL_3
2
1
baseline
BL_4
2
3
post
post_1
2
2
post
post_2
2
4
post
post_3
2
3
post
post_4
And same as the previous post, the purpose is to try iterate a same test (can be any test) over the desired fixed combinations arranged across time,
i.e., For each participant, compare outcome(s) at BL_1 against post_1, then BL_1 and BL_2 against post_1 ... BL_1, BL_2, BL_3 and BL_4 against post_1 etc.
Basically all combinations increasing in the number of weeks tested before (BL_1 to 4) and after (post_1 to 2) treatment.
I tried modifying from #Caspar V.'s codes (thanks #Caspar V. for your previous response):
#creating pre/post data frames for later use
df <- read.csv("C:/Users/data.csv")
df_baseline <- filter(df, baseline_post == "baseline") %>%
rename(baseline = baseline_post) %>%
rename(time_baseline = time)
df_post <- filter(df, baseline_post == "post") %>%
rename(post = baseline_post) %>%
rename(time_post = time)
#generate a list of desired comparisons
comparisons = list()
for(a_len in seq_along(df_baseline$baseline)) for(b_len in seq_along(df_post$post)){
comp = list(baseline = head(df_baseline$time_baseline, a_len), post = head(df_post$time_post, b_len))
comparisons = append(comparisons, list(comp))
}
#KIV create combined df for time if required
df_baseline_post <- cbind(df_baseline$time_baseline, df_post$time_post)
colnames(df_baseline_post) = c("time_baseline", "time_post")
#iterate through list of comparisons
for(df_baseline_post in comparisons) {
cat(df_baseline_post$time_baseline, 'versus', df_baseline_post$time_post, '\n')
#this is where your analysis goes, poisson_frequencies being a test function I created
poisson_frequencies(df)
}
This is unfortunately my output, which are 16 "versus-es", because there are 16 possible combinations based on the above data:
versus
versus
versus
versus
versus
versus
...
versus
I am not sure what went wrong. Appreciate any input. I am new when it comes to programming in R.
There's a number of problems; the following should get you back on track. Good luck!
1)
You're getting 64 comparisons in comparisons, not 16. If you would just look at the contents of comparisons you'd see that. It's because you have duplicates in df$time. You'll need to remove them first:
#generate a list of desired comparisons
groupA = unique(df_baseline$time_baseline)
groupB = unique(df_post$time_post)
comparisons = list()
for(a_len in seq_along(groupA)) for(b_len in seq_along(groupB)) {
comp = list(baseline = head(groupA, a_len), post = head(groupB, b_len))
comparisons = append(comparisons, list(comp))
}
2)
The following block is not used, and the variable df_baseline_post is overwritten in the for-loop after it, so you can just remove this:
#KIV create combined df for time if required
# df_baseline_post <- cbind(df_baseline$time_baseline, df_post$time_post)
# colnames(df_baseline_post) = c("time_baseline", "time_post")
3)
You're executing poisson_frequencies(df) every time, but not doing anything with the output. That's why you're not seeing anything. You'll need to put a print() around it: print(poisson_frequencies(df)). Of course df is also not the data you want to work with, but I hope you already knew that.
4)
df_baseline_post$time_baseline and df_baseline_post$time_post don't exist. The loop should be:
for(df_baseline_post in comparisons) {
cat(df_baseline_post$baseline, 'versus', df_baseline_post$post, '\n')
print(poisson_frequencies(df))
}

Find the GROWTH RATE of FaceValue for 5 days in percentage

I'm trying to open another column and find the growth rate of the facevalue column per day in percentage
Day
FaceValue
1
₦72,077,680.94
2
₦112,763,770.99
3
₦118,146,250.01
4
₦74,446,035.80
5
₦77,026,183.71
here is the code but it's not working
value_performance%>%
mutate(change=(value_performance$FaceValue-lag(FaceValue,5))/lag(FaceValue,5)*100)
Thanks
Three problems:
FaceValue appears to be a string, not numeric, try first fixing that with as.numeric;
(Almost) never use value_performance$ inside of a dplyr-pipe verb. ("Almost" because there are rare times when you need it. Otherwise you are at best being inefficient, possibly using incorrect values depending on what is happening in the pipe before its use.); and
You say "per day" but you are lagging by 5. While I'm assuming your real data has more than 5 rows, you are still not calculating by-day.
Try this.
value_performance %>%
mutate(
FaceValue = as.numeric(gsub("[^0-9.]", "", FaceValue)),
change = (FaceValue - lag(FaceValue))/lag(FaceValue)
)
# Day FaceValue change
# 1 1 7.21e+07 NA
# 2 2 1.13e+08 0.5645
# 3 3 1.18e+08 0.0477
# 4 4 7.44e+07 -0.3699
# 5 5 7.70e+07 0.0347
With similar data:
Day <- c(1,2,3,4,5)
FaceValue <- c(72077680.94, 112763770.99, 118146250.01, 74446035.80, 77026183.71)
df <- data.frame(Day, FaceValue)
df
df %>%
mutate(change= 100*(FaceValue/lag(FaceValue)-1)
)
Results in:
Day FaceValue change
1 1 72077681 NA
2 2 112763771 56.447557
3 3 118146250 4.773234
4 4 74446036 -36.988236
5 5 77026184 3.465796
Not sure what is wrong. Maybe check your data classes and make sure FaceValue is numerical.

How can I group by one variable in terms of status of a different variable in a longitudinal situation in R?

I'm new to R, so please go easy on me... I have some longitudinal data that looks like
Basically, I'm trying to find a way to get a table with a) the number of unique cases that have all complete data and b) the number of unique cases that have at least one incomplete or missing data. The end results would ideally be
df<- df %>% group_by(Location)
df1<- df %>% group_by(any(Completion_status=='Incomplete' | 'Missing'))
Not sure about what you want, because it seems there are something of inconsistent between your request and the desired output, however lets try, it seems you need a kind of frequency table, that you can manage with basic R. At the bottom of the answer you can find some data similar to yours.
# You have two cases, the Complete, and the other, so here a new column about it:
data$case <- ifelse(data$Completion_status =='Complete','Complete', 'MorIn')
# now a frequency table about them: if you want a data.frame, here we go
result <- as.data.frame.matrix(table(data$Location,data$case))
# now the location as a new column rather than the rownames
result$Location <- rownames(result)
# and lastly a data.frame with the final results: note that you can change the names
# of the columns but if you want spaces maybe a tibble is better
result <- data.frame(Location = result$Location,
`Number.complete` = result$Complete,
`Number.incomplete.missing` = result$MorIn)
result
Location Number.complete Number.incomplete.missing
1 London 0 1
2 Los Angeles 0 1
3 Paris 3 1
4 Phoenix 0 2
5 Toronto 1 1
Or if you prefere a dplyr chain:
data %>%
mutate(case = ifelse(data$Completion_status =='Complete','Complete', 'MorIn')) %>%
do( as.data.frame.matrix(table(.$Location,.$case))) %>%
mutate(Location = rownames(.)) %>%
select(3,1,2) %>%
`colnames<-`(c("Location","Number of complete ", "Number of incomplete or"))
Location Number of complete Number of incomplete or
1 London 0 1
2 Los Angeles 0 1
3 Paris 3 1
4 Phoenix 0 2
5 Toronto 1 1
With data:
# here your data (next time try to put them in an usable way in the question)
data <- data.frame( ID = c("A1","A1","A2","A2","B1","C1","C2","D1","D2","E1"),
Location = c('Paris','Paris','Paris','Paris','London','Toronto','Toronto','Phoenix','Phoenix','Los Angeles'),
Completion_status = c('Complete','Complete','Incomplete','Complete','Incomplete','Missing',
'Complete','Incomplete','Incomplete','Missing'))

Performing a 2 sample t test in R with replicates

I have a dataframe name R_alltemp in R with 6 columns, 2 groups of data with 3 replicates each. I'm trying to perform a t-test for each row between the first three values and the last three and use apply() so it can go through all the rows with one line. Here is the code im using so far.
R_alltemp$p.value<-apply(R_all3,1, function (x) t.test(x(R_alltemp[,1:3]), x(R_alltemp[,4:6]))$p.value)
and here is a snapshot of the table
R1.HCC827 R2.HCC827 R3.HCC827 R1.nci.h1975 R2.nci.h1975 R3.nci.h1975 p.value
1 13.587632 22.225083 15.074230 58.187465 79 82.287573 0.4391160
2 2.717526 1.778007 1.773439 1.763257 2 1.679338 0.4186339
3 203.814478 191.135711 232.320487 253.908939 263 263.656100 0.4904493
4 44.386264 45.339169 54.089884 3.526513 3 5.877684 0.3095634
it functions, but the p-values im getting just from eyeballing it seem wrong. For instance in the first line, the average of the first group is way lower than the second group, but my p value is only .4.
I feel like I'm missing something very obvious here, but I've been struggling with it for much longer than I'd like. Any help would be appreciated.
Your code is incorrect. I actually don't understand why it does not return an error. This part in particular: x(R_alltemp[,1:3]) should be x[1:3].
This should be your code:
R_alltemp$p.value2 <- apply(R_alltemp, 1, function(x) t.test(x[1:3], x[4:6])$p.value)
R1.HCC827 R2.HCC827 R3.HCC827 R1.nci.h1975 R2.nci.h1975 R3.nci.h1975 p.value p.value2
1 13.587632 22.225083 15.074230 58.187465 79 82.287573 0.4391160 0.010595829
2 2.717526 1.778007 1.773439 1.763257 2 1.679338 0.4186339 0.477533387
3 203.814478 191.135711 232.320487 253.908939 263 263.656100 0.4904493 0.044883436
4 44.386264 45.339169 54.089884 3.526513 3 5.877684 0.3095634 0.002853154
Remember that by specifying 1 it you are telling apply to get the columns. So function(x) returns the equivalent of this: x <- c(13.587632, 22.225083, 15.074230, 58.187465, 79, 82.287573) which means you want to subset the first three values by x[1:3] and then the last three x[4:6] and apply t.test to them.
A good idea before using apply is to test the function manually so if you do get odd results like these you know something went wrong with your code.
So the two-tailed p-value for the first row should be:
> g1 <- c(13.587632, 22.225083, 15.074230)
> g2 <- c(58.187465, 79, 82.287573)
> t.test(g1,g2)$p.value
[1] 0.01059583
Applying the function across all rows (I tacked the new p-val at the end as pval:
> tt$pval <- apply(tt,1,function(x) t.test(x[1:3],x[4:6])$p.value)
> tt
R1.HCC827 R2.HCC827 R3.HCC827 R1.nci.h1975 R2.nci.h1975 R3.nci.h1975 p.value pval
1 13.587632 22.225083 15.074230 58.187465 79 82.287573 0.4391160 0.010595829
2 2.717526 1.778007 1.773439 1.763257 2 1.679338 0.4186339 0.477533387
3 203.814478 191.135711 232.320487 253.908939 263 263.656100 0.4904493 0.044883436
4 44.386264 45.339169 54.089884 3.526513 3 5.877684 0.3095634 0.002853154
Maybe it's the double-use of the data frame name in the function (that you don't need)?

R aggregate by a variable then find out proportion of a each column

Sorry, I've tried my best but I didn't find the answer. As beginner, I'm not sure that I'm able to put the question clearly. Thanks in advance.
So I have a dataframe with data about consumption with 24000 rows.
In this dataframe, there is a series of variable about the number of objects bought within the last two months :
NumberOfCoat, NumberOfShirt, NumberOfPants, NumberOfShoes...
And there is a variable "profession" registered by number.
So now the data looks looks like this
profession NumberOfCoat NumberOfShirt NumberOfShoes
individu1 1 1 1 1
individu2 3 2 4 1
individu3 2 2 0 0
individu4 6 0 3 2
individu5 5 0 2 3
individu6 7 1 0 5
individu7 4 3 1 2
I would like to know the structure of consumption by profession and get something like this :
ProportionOfCoat ProportionOfShirt ProportionOfShoe...
profession1 0.3 0.5 0.1
profession2 0.1 0.2 0.4
profession3 0.2 0.6 0.1
profession4 0.1 0.1 0.2
I don't know if it is clear, but finally I want to be able to say :
10% of clothing products that doctors bought are Tshirts whereas 20% of what teachers bought are T-shirts.
And finally, I'd like to draw a stacked barplot where each stack is scaled to sum to 100%.
I suppose that we can you dplyr ?
Thank you very much !!
temp <- aggregate( . ~ profession, data=zzz, FUN=sum)
cbind(temp[1],temp[-1]/rowSums(temp[-1]))
or also using prop.table
As other people noted, it is always better to post a reproducible example, I´ll try to post one with my solution, which is longer than the ones already posted but, for the same reason, maybe clearer.
First you should create an example dataframe:
set.seed(10) # I set a seed cause I´ll use the sample() function
n <- 1:100 # vector from 1 to 100 to obtain the number of products bought
p <- 1:8 # vector for obtaining id of professions
profession <- sample(p,50, replace = TRUE)
NumberOfCoat <- sample(n,50, replace = TRUE)
NumberOfShirt <- sample(n,50, replace = TRUE)
NumberOfShoes <- sample(n,50, replace = TRUE)
df <- as.data.frame(cbind(profession, NumberOfCoat,
NumberOfShirt, NumberOfShoes))
Once you got the dataframe, you can explain what you have tried so far, or a possible solution. Here I used dplyr.
df <- df %>% group_by(profession) %>% summarize(coats = sum(NumberOfCoat),
shirts = sum(NumberOfShirt),
shoes = sum(NumberOfShoes)) %>%
mutate(tot_prod = coats + shirts + shoes,
ProportionOfCoat = coats/tot_prod,
ProportionOfShirt = shirts/tot_prod,
ProportionofShoes = shoes/tot_prod) %>%
select(profession, ProportionOfCoat, ProportionOfShirt,
ProportionofShoes)
dfcorresponds to the second dataframe you show, where you have the proportion of each product bought by each profession. In my example looks like this:
profession ProportionOfCoat ProportionOfShirt ProportionofShoes
<int> <dbl> <dbl> <dbl>
1 1 0.3910483 0.2343934 0.3745583
2 2 0.4069641 0.3525571 0.2404788
3 3 0.3330804 0.3968134 0.2701062
4 4 0.2740657 0.3952435 0.3306908
5 5 0.2573991 0.3784753 0.3641256
6 6 0.2293814 0.3543814 0.4162371
7 7 0.2245841 0.3955638 0.3798521
8 8 0.2861635 0.3490566 0.3647799
If you want to produce a stacked barplot, you have to reshape your data to a long format in order to be able to use ggplot2. As #alistaire noted, you can do it with the gather function from the tidyr package.
df <- df %>% gather(product, proportion, -profession)
And finally you can plot it with ggplot2.
ggplot(df, aes(x=profession, y=proportion, fill=product)) +
geom_bar(stat="identity")

Resources