Create a histogram with a non-numeric variable in R [duplicate] - r

Imagine you have a data frame with 2 variables - Name & Age. Name is of class factor and Age number. Now imagine now there are thousands of people in this data frame. How do you:
Produce a table with: NAME | COUNT(NAME) for each name uniquely?
Produce a histogram where you can change the minimum number of
occurrences to show up in the histogram.?
For part 2, I want to be able to test different minimum frequency values and see how the histogram comes out. Or is there a better method pragmatically to determine the minimum count for each name to enter the histogram?
Thanks!
Edit: Here is what the table would look like in a RDBS:
NAME | COUNT(NAME)
John | 10
Bill | 24
Jane | 12
Tony | 50
Emanuel| 1
...
What I want to be able to do is create a function to graph a histogram, where I can change a value that sets the minimum frequency to be graphed. Make more sense?

> x <- read.table(textConnection('
+ Name Age Gender Presents Behaviour
+ 1 John 9 male 25 naughty
+ 2 Bill 5 male 20 nice
+ 3 Jane 4 female 30 nice
+ 4 Jane 4 female 20 naughty
+ 5 Tony 4 male 34 naughty'
+ ), header=TRUE)
>
> table(x$Name)
Bill Jane John Tony
1 2 1 1
> layout(matrix(1:4, ncol = 2))
> plot(table(x$Name), main = "plot method for class \"table\"")
> barplot(table(x$Name), main = "barplot")
> tab <- as.numeric(table(x$Name))
> names(tab) <- names(table(x$Name))
> dotchart(tab, main = "dotchart or dotplot")
> ## or just this
> ## dotchart(table(dat))
> ## and ignore the warning
> layout(1)

Related

Fid sample size based on num of rows in data

I have a dataset that looks like this:
Region
Name
Region 1
Name 14
Region 2
Name 18
Region 2
Name 2
Region 2
Name 21
Region 2
Name 44
Region 3
Name 64
Region 3
Name 24
Region 4
Name 1
Region 4
Name 1
Region 4
Name 98
Region 5
Name 98
Region 5
Name 8
Region 5
Name 8
Region 5
Name 8
Region 5
Name 98
I need to breakup the data by Region, and then select a random sample of only 5% of the "Name" per Region, based on the number of rows in Region.
So lets say there are 30 Name in Region 2, then i need a random sample of 3*.05. If there are 50 Name in Region 6, then i need a random sample of 5*.05.
So far, ive been able to split() the data using
d = split(data, f = data$Region)
but when i try to run an lapply function i get an error that there are different number of rows in the list that split() provided
lapply(data, function(x) {
sample_n(data, nrow(d)*.05)
} )
Any thoughts?
Thank you
Here's a base R solution.
lapply(split(data, data$Region),
\(x) x[sample(nrow(x), nrow(x) * 0.05),])
You can then convert it back into a data frame with rbind

R sampling with if statement and similar number of sample

I need to to create a sample from my dataframe and to do so I am using the code bellow.
name <- sample(c("Adam","John","Henry","Mike"),100,rep = TRUE)
area <- sample(c("run","develop","test"),100,rep = TRUE)
id <- sample(100:200,100,rep = FALSE)
mydata <- as.data.frame(cbind(id,area,name))
qcsample <- mydata %>%
group_by(area) %>%
nest() %>%
mutate(n = c(20, 15, 15)) %>%
mutate(samp = map2(data, n, sample_n)) %>%
select(area, samp) %>%
unnest()
Now, I am getting these results.
table(qcsample$area)
develop run test
15 15 20
--
table(qcsample$name)
Adam Henry John Mike
9 9 16 16
I would like to create a sample that would have more or less the same number of samples for each name eg. Adam - 12, Henry - 12, John - 13, Mike - 13.
How can I achieve that ? can I somehow request that the sample is equally distributed ?
Also, in this example I used function
sample_n
and specified number of samples.
I am anticipating that sometimes there will not be required number from a given group. In my example I am taking 20 samples from area called "test" but sometimes there will be only let's say 10 rows containing "test". The total number is 50 so I need to make sure if there are only 10 "test" the code has to automatically increase the others, so the sample would be "test" - 10, "run" - 20 and "develop" - 20. This can happen to any of the area so I need to test if there is enough rows to create the sample and increase other areas. If there is only 1 it can be added to any of the remaining areas or if the difference is 3 we add 1 to one area and 2 to the another one.
How could I check that taking into account all the possibilities ? I believe there are eight permutations in this case.
Thanks in advance A.
If you are using made up data then you can create a minimum amount of each row and then create filler to get you up to the total:
set.seed(42)
names <- c("Adam", "John", "Henry", "Mike")
areas <- c("run", "develop", "test")
totalrows <- 100
minname <- 22 # No less than 20 of each name (set to near threshold to test)
minarea <- 30 # No less than 30 of each area (less randomness the higher these are)
qcsample <- data.frame(
name=sample(c(rep(names, minname), sample(names, totalrows-length(names)*minname, replace=T))),
area=sample(c(rep(areas, minarea), sample(areas, totalrows-length(areas)*minarea, replace=T))),
id=sample(99+(1:totalrows))
)
This results in:
R> table(qcsample$name)
Adam Henry John Mike
23 28 24 25
R> table(qcsample$area)
develop run test
37 31 32
Notice that the count of name to area isn't constrained:
R> table(qcsample[,-3])
area
name develop run test
Adam 5 11 7
Henry 11 8 9
John 10 7 7
Mike 11 5 9
R>
Using a loop as suggested by #r2evans:
library(dplyr)
set.seed(42)
mydata <- data.frame(
name = sample(c("Adam","John","Henry","Mike"), 100, rep = TRUE),
area = sample(c("run","develop","test"), 100, rep = TRUE),
id = sample(100:200, 100, rep = FALSE)
)
Nsamples <- 50
mysample <- data.frame(sample_n(mydata, Nsamples))
minname <- 11 # max is 50/4 -> 12
minarea <- 15 # max is 50/3 -> 16
# the test you were asking about
while( (min(table(mysample$name)) < minname) || (min(table(mysample$area)) < minarea) ) {
mysample <- data.frame(sample_n(mydata, Nsamples))
}
This results in:
R> table(mysample$name)
Adam Henry John Mike
13 15 11 11
R> table(mysample$area)
develop run test
15 17 18
And, like before, there's no minimum of name to area.
R> table(mysample[-3])
area
name develop run test
Adam 4 3 6
Henry 2 6 7
John 4 4 3
Mike 5 4 2
If you needed to enforce an minimum number for each permutation add this to the test:
while(... || (min(table(mysample[-3])) < some_min)) {
BTW, the number of permutations, as you can see by the table, is the number of names times the number of areas.
Here's another thought.
Depending on your desired end-size, it might over-create the number of samples so that it can reduce some name/area pairs to bring the total down.
Let's say that you want to end up with a total of 50 rows:
final_size <- 50
For completeness, here are the sets from which we'll choose:
avail_names <- c("Adam", "John", "Henry", "Mike")
avail_areas <- c("run", "develop", "test")
and the minimum we need to create for Adam,run (etc) in order to certainly end up with no less than final_size rows:
size_per_namearea <- ceiling(final_size / (length(avail_names) * length(avail_areas)))
Ok, generate at least as many (likely more than) the number of rows we need:
set.seed(20180920)
qcsample <- crossing(data_frame(rownum = seq_len(size_per_namearea)),
data_frame(name = avail_names),
data_frame(area = avail_areas)) %>%
group_by(name, area) %>%
mutate(id = sample(100, size = n(), replace = FALSE))
qcsample
# # A tibble: 60 x 4
# # Groups: name, area [12]
# rownum name area id
# <int> <chr> <chr> <int>
# 1 1 Adam run 59
# 2 1 Adam develop 51
# 3 1 Adam test 23
# 4 1 John run 71
# 5 1 John develop 5
# 6 1 John test 24
# 7 1 Henry run 4
# 8 1 Henry develop 29
# 9 1 Henry test 79
# 10 1 Mike run 77
# # ... with 50 more rows
Verify we have identical sample sizes for each name/area:
xtabs(~ name + area, data = qcsample) %>%
stats::addmargins()
# area
# name develop run test Sum
# Adam 5 5 5 15
# Henry 5 5 5 15
# John 5 5 5 15
# Mike 5 5 5 15
# Sum 20 20 20 60
If we just do head(final_size), then we know which names we will be cutting short, which undermines the randomness of your sampling a little. The reason I added rownum up front was so that I can arrange by it plus a jitter, ensuring I get all of max(rownum)-1, and then some sampling of max(rownum), guaranteeing that each name/area pair have either max(rownum)-1 or max(rownum) rows; your tallies are never different by more than 1.
reducedsample <- arrange(qcsample, rownum + runif(n())) %>%
head(final_size) %>%
select(-rownum)
reducedsample %>%
xtabs(~ name + area, data = .) %>%
stats::addmargins()
# area
# name develop run test Sum
# Adam 4 4 5 13
# Henry 5 4 4 13
# John 4 4 4 12
# Mike 4 4 4 12
# Sum 17 16 17 50

Check if a variable is time invariant in R

I tried to search an answer to my question but I find the right answer for Stata (I am using R).
I am using a national survey to study which variables influence the investment in complementary pension (it is voluntary in my country).
The survey is conducted every two years and some individuals are interviewed more than one time. I filtered the df in order to have only the individuals present more than one time trought the filter command. This is an example from the original survey already filtered:
year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 1
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1
2008 4 1972 F 33000 0
2010 4 1972 F 35000 0
where id is the individual, y.b is year of birth, pens is a dummy which takes value 1 if the individual invests in a complementary pension form.
I wanted to run a FE regression so I load the plm package and then I set the df like this:
df.p <- plm.data(df, c("id", "year")
After this command, I expected that constant variables were deleted but after running this regression:
pan1 <- plm (pens ~ woman + age + I(age^2) + high + medium + north + centre, model="within", effect = "individual", data=dd.p, na.action = na.omit)
(where woman is a variable which takes value 1 if the individual is a woman, high, medium refer to education level and north, centre to geographical regions) and after the command summary(pan1) the variable woman is still present.
At this point I think that there are some mistakes in the survey (for example sex was not insert correctly and so it wasn't the same for the same id), so I tried to find a way to check if for each id, sex is constant.
I tried this code but I am sure it is not correct:
df$x <- ifelse(df$id==df$id & df$sex==df$sex,1,0)
the basic idea shuold be like this:
df$x <- ifelse(df$id=="1" & df$sex=="F",1,0)
but I can't do it manually since the df is composed up to 40k observations.
If you know another way to check if a variable is constant in R I will be glad.
Thank you in advance
I think what you are trying to do is calculate the number of unique values of sex for each id. You are hoping it is 1, but any cases of 2 indicate a transcription error. The way to do this in R is
any(by(df$sex,df$id,function(x) length(unique(x))) > 1)
To break that down, the function length(unique(x)) tells you the number of different unique values in a vector. It's similar to levels for a factor (but not identical, since a factor can have levels not present).
The function by calculates the given function on each subset of df$sex according to df$id. In other words, it calculates length(unique(df$sex)) where df$id is 1, then 2, etc.
Lastly, any(... > 1) checks if any of the results are more than one. If they are, the result will be TRUE (and you can use which instead of any to find which ones). If everything is okay, the result will be FALSE.
We can try with dplyr
Example data:
df=data.frame(year=c(2002,2002,2004,2004,2006,2008,2008,2010),
id=c(1,2,1,2,3,3,4,4),
sex=c("F","M","M","M","M","M","F","F"))
Id 1 is both F and M
library(dplyr)
df%>%group_by(id)%>%summarise(sexes=length(unique(sex)))
# A tibble: 4 x 2
id sexes
<dbl> <int>
1 1 2
2 2 1
3 3 1
4 4 1
We can then filter:
df%>%group_by(id)%>%summarise(sexes=length(unique(sex)))%>%filter(sexes==2)
# A tibble: 1 x 2
id sexes
<dbl> <int>
1 1 2

Counting Occurences & Histogram Charts in R

Imagine you have a data frame with 2 variables - Name & Age. Name is of class factor and Age number. Now imagine now there are thousands of people in this data frame. How do you:
Produce a table with: NAME | COUNT(NAME) for each name uniquely?
Produce a histogram where you can change the minimum number of
occurrences to show up in the histogram.?
For part 2, I want to be able to test different minimum frequency values and see how the histogram comes out. Or is there a better method pragmatically to determine the minimum count for each name to enter the histogram?
Thanks!
Edit: Here is what the table would look like in a RDBS:
NAME | COUNT(NAME)
John | 10
Bill | 24
Jane | 12
Tony | 50
Emanuel| 1
...
What I want to be able to do is create a function to graph a histogram, where I can change a value that sets the minimum frequency to be graphed. Make more sense?
> x <- read.table(textConnection('
+ Name Age Gender Presents Behaviour
+ 1 John 9 male 25 naughty
+ 2 Bill 5 male 20 nice
+ 3 Jane 4 female 30 nice
+ 4 Jane 4 female 20 naughty
+ 5 Tony 4 male 34 naughty'
+ ), header=TRUE)
>
> table(x$Name)
Bill Jane John Tony
1 2 1 1
> layout(matrix(1:4, ncol = 2))
> plot(table(x$Name), main = "plot method for class \"table\"")
> barplot(table(x$Name), main = "barplot")
> tab <- as.numeric(table(x$Name))
> names(tab) <- names(table(x$Name))
> dotchart(tab, main = "dotchart or dotplot")
> ## or just this
> ## dotchart(table(dat))
> ## and ignore the warning
> layout(1)

R "melt-cast" like operation

I have a file contains contents like this:
name: erik
age: 7
score: 10
name: stan
age:8
score: 11
name: kyle
age: 9
score: 20
...
As you can see, each record actually contains 3 rows in the file. I am wondering how can I read in the file and transform into data dataframe looks like below:
name age score
erik 7 10
stan 8 11
kyle 9 20
...
What I have done so far(thanks tcash21):
> data <- read.table(file.choose(), header=FALSE, sep=":", col.names=c("variable", "value"))
> data
variable value
1 name erik
2 age 7
3 score 10
4 name stan
5 age 8
6 score 11
7 name kyle
8 age 9
9 score 20
I am thinking how can I split the column into two columns by : and then maybe use something similar like cast in reshape package to do what I want?
or how can I get the rows that has index number 1,4,7,... only, which has a constant step
Thanks!
Another possibility:
library(reshape2)
df$id <- rep(1:(nrow(df)/3), each = 3)
dcast(df, id ~ variable, value.var = "value")
# id age name score
# 1 1 7 erik 10
# 2 2 8 stan 11
# 3 3 9 kyle 20
If the format is predictable you might want to do something really simple like
# recreate data
data <- as.matrix(c("erik",7,10,"stan",8, 11,"kyle",9,20),ncol=1)
# get individual variables
names <- data[seq(1,length(data)-2,3)]
age <- data[seq(2,length(data)-1,3)]
score <- data[seq(3,length(data),3)]
# combine variables
reformatted.data <- as.data.frame(cbind(names,age,score))

Resources