I have recently transitioned from STATA + Excel to R. So, I would appreciate if someone could help me in writing efficient code. I have tried my best to research the answer before posting on SO.
Here's how my data looks like:
mydata<-data.frame(sassign$buyer,sassign$purch,sassign$total_)
str(mydata)
'data.frame': 50000 obs. of 3 variables:
$ sassign.buyer : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 2 1 ...
$ sassign.purch : num 10 3 2 1 1 1 1 11 11 1 ...
$ sassign.total_: num 357 138 172 272 149 113 15 238 418 123 ...
head(mydata)
sassign.buyer sassign.purch sassign.total_
1 no 10 357
2 no 3 138
3 no 2 172
4 no 1 272
5 no 1 149
6 yes 1 113
My objective is to find average number of buyers with # of purchases > 1.
So, here's what I did:
Method 1: Long method
library(psych)
check<-as.numeric(mydata$sassign.buyer)-1
myd<-cbind(mydata,check)
abcd<-psych::describe(myd[myd$sassign.purch>1,])
abcd$mean[4]
The output I got is:0.1031536697, which is correct.
#Sathish: Here's how check looks like:
head(check)
0 0 0 0 0 1
This did solve my purpose.
Pros of this method: It's easy and typically a beginner level.
Cons: Too many-- I need an extra variable (check). Plus, I don't like this method--it's too clunky.
Side Question : I realized that by default, functions don't show higher precision although options (digits=10) is set. For instance, here's what I got from running :
psych::describe(myd[myd$sassign.purch>1,])
vars n mean sd median trimmed mad min max range skew
sassign.buyer* 1 34880 1.10 0.30 1 1.00 0.00 1 2 1 2.61
sassign.purch 2 34880 5.14 3.48 4 4.73 2.97 2 12 10 0.65
sassign.total_ 3 34880 227.40 101.12 228 226.13 112.68 30 479 449 0.09
check 4 34880 0.10 0.30 0 0.00 0.00 0 1 1 2.61
kurtosis se
sassign.buyer* 4.81 0.00
sassign.purch -1.05 0.02
sassign.total_ -0.72 0.54
check 4.81 0.00
It's only when I ran
abcd$mean[4]
I got 0.1031536697
Method 2: Using dplyr
I tried pipes and function call, but I finally gave up.
Method 2 | Try1:
psych::describe(dplyr::filter(mydata,mydata$sassign.purch>1)[,dplyr::mutate(as.numeric(mydata$sassign.buyer)-1)])
Output:
Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "c('double', 'numeric')"
Method 2 | Try2: Using pipes:
mydata %>% mutate(newcol = as.numeric(sassign.buyer)-1) %>% dplyr::filter(sassign.purch>1) %>% summarise(meanpurch = mean(newcol))
This did work, and I got meanpurch= 0.1031537. However, I am still not sure about Try 1.
Any thoughts why this isn't working?
Data:
> dt
# sassign.buyer sassign.purch sassign.total_
# 1 no 10 357
# 2 no 3 138
# 3 no 2 172
# 4 no 1 272
# 5 no 1 149
# 6 yes 1 113
Number of Buyers with purchases greater than 1
library(dplyr)
dt %>%
group_by(sassign.buyer) %>%
filter(sassign.purch > 1)
#
# Source: local data frame [3 x 3]
# Groups: sassign.buyer [1]
#
# sassign.buyer sassign.purch sassign.total_
# (chr) (int) (int)
# 1 no 10 357
# 2 no 3 138
# 3 no 2 172
Average number of buyers with purchases greater than 1
dt %>%
group_by(sassign.buyer) %>%
filter(sassign.purch > 1) %>%
summarise(avg_no_buyers_gt_1 = length(sassign.buyer)/ nrow(dt))
# Source: local data frame [1 x 2]
#
# sassign.buyer avg_no_buyers_gt_1
# (chr) (dbl)
# 1 no 0.5
If no grouping of buyers is required,
dt %>%
filter(sassign.purch > 1) %>%
summarise(avg_no_buyers_gt_1 = length(sassign.buyer)/ nrow(dt))
# avg_no_buyers_gt_1
# 1 0.7777778
Finding the proportion of cases that suit a condition is easy to do with mean(). Here's a blog post explaining it: https://drsimonj.svbtle.com/proportionsfrequencies-with-mean-and-booleans, and here's a simple example:
buyer <- c("yes", "yes", "no", "no")
mean(buyer == "yes")
#> [1] 0.5
So in your case, you can do mean(d$sassign.buyer[d$sassign.purch > 1] == "yes"). Here's a worked example:
d <- data.frame(
sassign.buyer = factor(c("yes", "yes", "no", "no")),
sassign.purch = c(1, 10, 0, 200)
)
mean(d$sassign.buyer[d$sassign.purch > 1] == "yes")
#> [1] 0.5
This gets all cases where d$sassign.purch is greater han 1, and then computes the proportion (using mean()) of these cases in which d$sassign.buyer is equal to "yes".
Related
I'm fairly new in R.
I have a database (panel) and I want to delete some observations based on certain values.
Let's take the next panel as an example (derived from plm packages):
Panel <-read.dta("http://dss.princeton.edu/training/Panel101.dta")
> head(Panel)
country year y y_bin x1 x2 x3 opinion op
1 A 1990 1342787840 1 0.2779036 -1.1079559 0.28255358 Str agree 1
2 A 1991 -1899660544 0 0.3206847 -0.9487200 0.49253848 Disag 0
3 A 1992 -11234363 0 0.3634657 -0.7894840 0.70252335 Disag 0
4 A 1993 2645775360 1 0.2461440 -0.8855330 -0.09439092 Disag 0
5 A 1994 3008334848 1 0.4246230 -0.7297683 0.94613063 Disag 0
6 A 1995 3229574144 1 0.4772141 -0.7232460 1.02968037 Str agree 1
I want to delete the observations for the next year when OP=1.
For instance if in 1990, OP =1, I want to exclude country in 1991, 1992, 1992, etc (all the next years of the database). If OP =1 in 1996, I want to exclude country in 1997, 1998 and 1999.
PS : The dataframe may be not be a good example but in my dataframe, OP = 1 only once.
Does anyone know how I can do that?
Thanks in advance.
EDIT : I forgot to say that I want also to keep observations that have OP=0 for each year. I'm running a logit model. Therefore I'm comparing OP=1 and OP=0.
I am assuming you want to remove all the rows after 1 in OP for each country separately.
Using dplyr with filter :
library(dplyr)
Panel <- foreign::read.dta("http://dss.princeton.edu/training/Panel101.dta")
Panel %>%
group_by(country) %>%
filter(row_number() <= match(1, op)) %>%
ungroup
# country year y y_bin x1 x2 x3 opinion op
# <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
# 1 A 1990 1342787840 1 0.278 -1.11 0.283 Str agree 1
# 2 B 1990 -5934699520 0 -0.0818 1.43 0.0234 Agree 1
# 3 C 1990 -1292379264 0 1.31 -1.29 0.204 Agree 1
# 4 D 1990 1883025152 1 -0.314 1.74 0.647 Disag 0
# 5 D 1991 6037768704 1 0.360 2.13 1.10 Disag 0
# 6 D 1992 10244189 1 0.0519 1.68 0.970 Str agree 1
# 7 E 1990 1342787840 1 0.453 1.73 0.597 Str disag 0
# 8 E 1991 2296009472 1 0.419 1.71 0.793 Str agree 1
# 9 F 1990 1342787840 1 -0.568 -0.347 1.26 Str agree 1
#10 G 1990 1342787840 1 0.945 -1.52 1.45 Str disag 0
#11 G 1991 -1518985728 0 1.10 -1.46 1.44 Agree 1
Or same thing with slice :
Panel %>%
group_by(country) %>%
slice(seq_len(match(1, op))) %>%
ungroup
We can use slice
library(dplyr)
Panel %>%
group_by(country) %>%
slice(seq_len(match(1, op))) %>%
ungroup
data
Panel <- foreign::read.dta("http://dss.princeton.edu/training/Panel101.dta")
Your answers were great. But actually, I forgot to precise something in the question. Your answers allow me to keep observations which had op=1. But I want also to keep those who have OP=0 for each year. I'm running a logit model. By the way those who have OP=0 will be the non adopters for instance and the OP=1 will be adopters.
Sorry in advance if the post isn't clear.
So I have my dataframe, 74 observations and 43 columns. I performed cluster analysis on them.
I then got 5 clusters, and assigned the cluster number to each respective row.
Now,
my df has 74 rows (obs) and 44 variables. And I would like to plot and see in each cluster what variables are enriched and what variables are not, for all variables.
I want to achieve this by ggplot.
My imaginary output panel is to have 5 boxplots per row, and 42 rows plots, each row will describe a variable measured in the dataset.
Example of the dataset (sorry its very big so I made an example, actual values are different)
df
ID EGF FGF_2 Eotaxin TGF G_CSF Flt3L GMSF Frac IFNa2 .... Cluster
4300 4.21 139.32 3.10 0 1.81 3.48 1.86 9.51 9.41 .... 1
2345 7.19 233.10 0 1.81 3.48 1.86 9.41 0 11.4 .... 1
4300 4.21 139.32 4.59 0 1.81 3.48 1.86 9.51 9.41 .... 1
....
3457 0.19 233.10 0 1.99 3.48 1.86 9.41 0 20.4 .... 3
5420 4.21 139.32 3.10 0.56 1.81 3.48 1.86 9.51 29.8 .... 1
2334 7.19 233.10 2.68 2.22 3.48 1.86 9.41 0 28.8 .... 5
str(df)
$ ID : Factor w/ 45 levels "4300"..... : 44 8 24 ....
$ EGF : num ....
$ FGF_2 : num ....
$ Eotaxin : num ....
....
$ Cluster : Factor w/ 5 levels "1" , "2"...: 1 1 1.....3 1 5
#now plotting
#thought I pivot the datafram
new_df <- pivot_longer(df[,2:44],df$cluster, names_to = "Cytokine measured", values_to = "count")
#ggplot
ggplot(new_df,aes(x = new_df$cluster, y = new_df$count))+
geom_boxplot(width=0.2,alpha=0.1)+
geom_jitter(width=0.15)+
facet_grid(new_df$`Cytokine measured`~new_df$cluster, scales = 'free')
So the code did generate a small panel of the graphs that fit my imaginary output. But I can see only
5 rows instead of 42.
So going back to new_df, the last 3 columns draw my attention:
Cluster Cytokine measured count
1 EGF 2.66
1 FGF_2 390.1
1 Eotaxin 6.75
1 TGF 0
1 G_CSF 520
3 EGF 45
5 FGF_2 4
4 Eotaxin 0
1 TGF 0
1 G_CSF 43
....
So it seems the cluster number and count column is correct whereas the cytokine measured just kept repeating the 5 variable names, instead of the total 42 variables I want to see.
I think the table conversion step is wrong, but I dont quite know what went wrong and how to fix it.
Please enlighten me.
We can try this, I simulate something that looks like your data frame:
df = data.frame(
ID=1:74,matrix(rnorm(74*43),ncol=43)
)
colnames(df)[-1] = paste0("Measurement",1:43)
df$cluster = cutree(hclust(dist(scale(df[,-1]))),5)
df$cluster = factor(df$cluster)
Then melt:
library(ggplot2)
library(tidyr)
library(dplyr)
melted_df = df %>% pivot_longer(-c(cluster,ID),values_to = "count")
g = ggplot(melted_df,aes(x=cluster,y=count,col=cluster)) + geom_boxplot() + facet_wrap(~name,ncol=5,scale="free_y")
You can save it as a bigger plot to look at:
ggsave(g,file="plot.pdf",width=15,height=15)
How can I apply a package function to a data frame ?
I have a data set (df) with two columns (total and n) on which I would like to apply the pois.exact function (pois.exact(x, pt = 1, conf.level = 0.95)) from the epitools package with x = df$n and pt = df$total f and get a "new" data frame (new_df) with 3 more columns with the corresponding rounded computed rates, lower and upper CI ?
df <- data.frame("total" = c(35725302,35627717,34565295,36170648,38957933,36579643,29628394,18212075,39562754,1265055), "n" = c(24,66,166,461,898,1416,1781,1284,329,12))
> df
total n
1 35725302 24
2 35627717 66
3 34565295 166
4 36170648 461
5 38957933 898
6 36579643 1416
7 29628394 1781
8 18212075 1284
9 9562754 329
In facts, the dataframe in much more longer.
For example, for the first row the desired results are:
require (epitools)
round (pois.exact (24, pt = 35725302, conf.level = 0.95)* 100000, 2)[3:5]
rate lower upper
1 0.07 0.04 0.1
The new dataframe with the added results by applying the pois.exact function should look like that.
> new_df
total n incidence lower_95IC uppper_95IC
1 35725302 24 0.07 0.04 0.10
2 35627717 66 0.19 0.14 0.24
3 34565295 166 0.48 0.41 0.56
4 36170648 461 1.27 1.16 1.40
5 38957933 898 2.31 2.16 2.46
6 36579643 1416 3.87 3.67 4.08
7 29628394 1781 6.01 5.74 6.03
8 18212075 1284 7.05 6.67 7.45
9 9562754 329 3.44 3.08 3.83
Thanks.
df %>%
cbind( pois.exact(df$n, df$total) ) %>%
dplyr::select( total, n, rate, lower, upper )
# total n rate lower upper
# 1 35725302 24 1488554.25 1488066.17 1489042.45
# 2 35627717 66 539813.89 539636.65 539991.18
# 3 34565295 166 208224.67 208155.26 208294.10
# 4 36170648 461 78461.28 78435.71 78486.85
# 5 38957933 898 43383.00 43369.38 43396.62
# 6 36579643 1416 25833.08 25824.71 25841.45
# 7 29628394 1781 16635.82 16629.83 16641.81
# 8 18212075 1284 14183.86 14177.35 14190.37
# 9 39562754 329 120251.53 120214.06 120289.01
# 10 1265055 12 105421.25 105237.62 105605.12
Data
Given a data frame
df <- data.frame("id"=c(1,2,3), "a"=c(10.0, 11.2, 12.3),"b"=c(10.1, 11.9, 12.9))
> df
id a b
1 1 10.0 10.1
2 2 11.2 11.9
3 3 12.3 12.9
> str(df)
'data.frame': 3 obs. of 3 variables:
$ id: num 1 2 3
$ a : num 10 11.2 12.3
$ b : num 10.1 11.9 12.9
Question
When subsetting the first row, the .0 decimal part from the 10.0 in column a gets dropped
> df[1,]
id a b
1 1 10 10.1
> str(df[1,])
'data.frame': 1 obs. of 3 variables:
$ id: num 1
$ a : num 10
$ b : num 10.1
I 'assume' this is intentional, but how do I subset the first row so that it keeps the .0 part?
Notes
Subsetting two rows keeps the .0
> df[1:2,]
id a b
1 1 10.0 10.1
2 2 11.2 11.9
I assume you understand this is a matter of how the number is printed, and not about how the value is stored by R. Anyway, you can use format to ensure the digits will be printed:
> format(df[1,], nsmall = 1)
id a b
1 1.0 10.0 10.1
> format(df[1,], nsmall = 2)
id a b
1 1.00 10.00 10.10
The reason for this behavior is not about the number of rows being printed. R will try to display the minimum number of decimals possible. But all numbers in a column will have the same number of digits to improve the display:
> df2 <- data.frame(a=c(1.00001, 1), b=1:2)
> df2
a b
1 1.00001 1
2 1.00000 2
Now if I print only the row with the non-integer number:
> df2[1,]
a b
1 1.00001 1
I am looking for a convenient way to convert positive values (proportions) into negative values of the same variable, depending on the value of another variable.
This is how the data structure looks like:
id Item Var1 Freq
1 P1 0 0.043
2 P2 1 0.078
3 P3 2 0.454
4 P4 3 0.543
5 T1 0 0.001
6 T2 1 0
7 T3 2 0.045
8 T4 3 0.321
9 A1 0 0.671
...
More precisely, I would like to put the numbers for Freq into the negative if Var1 <= 1 (e.g. -0.043).
This is what I tried:
for(i in 1: 180) {
if (mydata$Var1 <= "1") (mydata$Freq*(-1))}
OR
mydata$Freq[mydata$Var1 <= "1"] = -abs(mydata$Freq)}
In both cases, the negative sign is rightly set but the numbers are altered as well.
Any help is highly appreciated. THANKS!
new.Freq <- with(mydata, ifelse(Var1 <= 1, -Freq, Freq))
Try:
index <- mydata$Var1 <= 1
mydata$Freq[index] = -abs(mydata$Freq[index])
There are two errors in your attempted code:
You did a character comparison by writing x <= "1" - this should be a numeric comparison, i.e. x <= 1
Although you are replacing a subset of your vector, you don't refer to the same subset as the replacement
It can also be used to deal with two variables when one has negative values and want to combine that by retaining negative values,
similarly can use it to convert to negative value by put - at start of variable (as mentioned above) e.g. -Freq.
mydata$new_Freq <- with(mydata, ifelse(Var1 < 0, Low_Freq, Freq))
id Item Var1 Freq Low_Freq
1 P1 0 1.043 -0.063
2 P2 1 1.078 -0.077
3 P3 2 2.401 -0.068
4 P4 3 3.543 -0.323
5 T1 0 1.001 1.333
6 T2 1 1.778 1.887
7 T3 2 2.045 1.011
8 T4 3 3.321 1.000
9 A1 0 4.671 2.303
# Output would be:
id Item Var1 Freq Low_Freq new_Freq
1 P1 0 1.043 -0.063 -0.063
2 P2 1 1.078 -0.077 -0.077
3 P3 2 2.401 -0.068 -0.068
4 P4 3 3.543 -0.323 -0.323
5 T1 0 1.001 0.999 1.001
6 T2 1 1.778 0.887 1.778
7 T3 2 2.045 1.011 2.045
8 T4 3 3.321 1.000 3.321
9 A1 0 4.671 2.303 4.671