plotting a scatter plot with wide range data R - r

I uploaded a csv file to R studio and am trying to plot two columns. The first one shows the number of likes, and the second shows the number of shares. I want to show the relationship between the number of shares when people actually like a post.
The problem is my likes count starts from 1 to 1 million, and the shares count start from 5 to 37000.
sample of my dataset (both columns are of class factor)
topMedia$likes_count
[1] 61 120 271 140 59 498 241 117 124 124 225 117 186 101
[15] 118 134 152 136 153 124 100 77 98 77 88 48 58 66
topMedia$shares_count
[1] 12 171 NULL 23 34 108 430 NULL NULL NULL 283 NULL NULL 57
[15] NULL NULL NULL 68 105 NULL NULL 7 10 45 103 22 75 16
When I use this code to plot a scatter plot. It looks messy.
plot(as.numeric(topMedia$shares_count),as.numeric(topMedia$likes_count))
I tried using other libraries
library(hexbin)
cols = colorRampPalette(c("#fee6ce", "#fd8d3c", "#e6550d", "#a63603"))
plot(hexbin(as.numeric(topMedia$shares_count), as.numeric(topMedia$likes_count), xbins = 40), colorcut = seq(0,1,length=20),
colramp = function(n) cols(20), legend = FALSE,xlab = 'share count', ylab = 'like count')
but I get a similar result even with colours
what would be a better way to show the relationship between those values?
Thanks .

In this case, the even-ish distribution (for what should be a clear positive correlation between "likes" and "shares") is a clue that the numeric data might have been inadvertently loaded as a factor. Another clue is that the x and y value only vary by the number of unique values, not by the range of the underlying numeric data. We need to convert the levels of the factor (and not the values of the factor) to see the intended numbers. We can do this with something like as.numeric(as.character(x)).
To give an example, suppose we had some linearly correlated data like this:
library(ggplot2); library(dplyr)
set.seed(42)
fake_data <- data.frame(x = runif(10000, 0, 1000000))
fake_data$y <- pmax(0, fake_data$x*rnorm(10000, 1, 2) + runif(10000, 0, 1000000))
ggplot(fake_data, aes(x,y)) + geom_point()
If that numeric data were loaded in as factors (easy to do with read.csv if the term stringsAsFactors = FALSE isn't included), it might look more like this, not too dissimilar from the data in this question. The data here is being read as if it were character data, and then made into a factor which is ordered alphabetically, with "10000" before "2" because "1" comes before "2".
fake_data_factor <- fake_data %>%
mutate(x = as.factor(as.character(x)),
y = as.factor(as.character(y)))
The x and y values now have values related to their alphabetical order, different from their underlying levels. R uses the values to sort or to plot, and the x values with the lowest values in the new data have levels near 100,000 instead of near 0. In the table below, 100,124 in row 1 comes alphabetically earlier than 10,058 in row 8!
fake_data_factor %>%
arrange(x) %>%
head(8)
# x y
#1 100124.688120559 0
#2 100229.354342446 289241.187250382
#3 100299.560697749 232233.101769741
#4 100354.233058169 814492.563551191
#5 100364.253856242 1183870.56252858
#6 100370.0227011 1224652.83777805
#7 100461.616180837 1507465.73704898
#8 10058.1261795014 604477.823016668
ggplot(fake_data_factor, aes(as.numeric(x),as.numeric(y))) +
geom_point()
We can get back to the intended numbers by converting the factors to character (which extracts each one's level) and then converting those to numeric.
fake_data_factor %>%
ggplot(aes(as.numeric(as.character(x)),as.numeric(as.character(y)))) +
geom_point()

Related

Calculate Multiple Information Value in R

I am new to R programming and trying to learn part time so apologize for naive coding and questions in advance. I have spent about 1 day trying to figure out code for this and unable to do so hence asking here.
https://www.kaggle.com/c/titanic/data?select=train.csv
I am working on train Titanic Data set from Kaggle imported as train_data. I have cleaned up all the col and also converted them to factor where needed.
My question is 2 fold:
1. Unable to understand why this formula gives IV values as 0 for everything. What have I done wrong?
factor_vars <- colnames(train_data)
all_iv <- data.frame(VARS=factor_vars, IV=numeric(length(factor_vars)),STRENGTH=character(length(factor_vars)),stringsAsFactors = F)
for (factor_var in factor_vars){
all_iv[all_iv$VARS == factor_var, "IV"] <-
InformationValue::IV(X=train_data[, factor_var], Y=train_data$Survived)
all_iv[all_iv$VARS == factor_var, "STRENGTH"] <-
attr(InformationValue::IV(X=train_data[, factor_var], Y=train_data$Survived), "howgood")
}
all_iv <- all_iv[order(-all_iv$IV), ]
2. I am trying to create my own function to calculate IV values for multiple columns in 1 go so that I do not have to do repetitive task however when I run the following formula I get count of total 0 and total 1 instead of items grouped by like I requested. Again, what is that I am doing wrong in this example?
train_data %>% group_by(train_data[[3]]) %>%
summarise(zero = sum(train_data[[2]]==0),
one = sum(train_data[[2]]==1))
I get output
zero one
1 549 342
2 549 342
3 549 342
where as I would anticipate an answer like:
zero one
1 80 136
2 97 87
3 372 119
what is wrong with my code?
3. Is there any pre built function which can give IV values for all columns? On searching I found iv.mult function but I can not get it to work. Any suggestion would be great.
Let's take a look at your questions:
1.
length(factor_vars)
#> [1] 12
length() returns the number of elements of your vector factor_vars. So your code numeric(length(factor_vars)) is evaluated to numeric(12) which returns an numeric vector of length 12, default filled with zeros.
The same applies to character(length(factor_vars)) which returns a character vector of length 12 filled with empty strings "".
Your code doesn't use a correct dplyr syntax.
library(dplyr)
library(dplyr)
train_data %>%
group_by(Pclass) %>%
summarise(zero = sum(Survived == 0),
one = sum(Survived == 1))
returns
# A tibble: 3 x 3
Pclass zero one
<dbl> <int> <int>
1 1 80 136
2 2 97 87
3 3 372 119
which is most likely what you are looking for.
Don't know the meaning of IV.

R-Studio Filtering Data

I have this data table as model:
ID PRODUCT_TYPE OFFER INENTORY
1 BED Y Y
2 TABLE N Y
3 MOUSE Y N
4 CELLPHONE Y Y
5 CAR Y Y
6 BED N N
7 TABLE N Y
8 MOUSE Y N
9 CELLPHONE Y Y
10 CAR Y Y
.....
I have to extract a sample of 50% of the total population and the sample must consist on appearance of the values ​​of the variables at least once (product_type == bed, cellphone, car, table, mouse, offer = Y, N, etc).
I used this to extract the sample:
subset1<- data2 %>% sample_frac(.5)
but I don't know how to integrate these conditions, can anyone help me with an advice?
It's unclear from the content of the original post whether the question it asks is How does one generate a stratified random sample based on combinations of a set of grouping variables? A stratified random sample is an appropriate approach in this situation because it ensures that each combination of grouping variables is proportionally represented in the sampled data frame.
A tidyverse solution
Since the question does not include a minimal reproducible example, we'll generate some data and illustrate how to split or group it and then randomly sample each of the subgroups.
To begin, we reset the seed for the random number generator and build a data frame containing 10,000 rows of products, where 50% of the products are on offer, and 70% are in inventory.
set.seed(1053807)
df <- data.frame(
productType = rep(c("Bed","Mouse","Table","Cellphone","Laptop","Car","Chair","Blanket",
"Sofa","Bicycle"),1000),
offer = ifelse(runif(10000) > .5,"Y","N"),
inventory = ifelse(runif(10000) > .3,"Y","N"),
price = rnorm(10000,200,10)
)
Given the three grouping variables in the original post, the df object contains 40 unique combinations of productType, offer, and inventory.
The original code attempts to use the dplyr package to sample the data. It was very close to a workable solution. To stratify the sample we use group_by() to group the data by split variables, and then use the sample_frac() function on the grouped data to generate the stratified sample.
library(dplyr)
df %>%
group_by(productType,offer,inventory) %>%
sample_frac(0.5) -> sampledData
Verifying results
A 50% sample from a 10,000 row data frame should have about 5,000 observations.
> nrow(sampledData)
[1] 5001
So far, so good.
We can then verify the results by counting numbers of rows in each stratum of the sample, and comparing them to the original counts for each subgroup in the input data frame.
# check results
originalCounts <- df %>%
group_by(productType,offer,inventory) %>%
summarise(OriginalCount = n())
sampledData %>%
group_by(productType,offer,inventory) %>%
summarise(SampledCount = n()) %>%
full_join(originalCounts,.) %>%
mutate(SampledPct = round(SampledCount / OriginalCount * 100,2))
...and the output:
# A tibble: 40 x 6
# Groups: productType, offer [20]
productType offer inventory OriginalCount SampledCount SampledPct
<chr> <chr> <chr> <int> <int> <dbl>
1 Bed N N 161 80 49.7
2 Bed N Y 371 186 50.1
3 Bed Y N 132 66 50
4 Bed Y Y 336 168 50
5 Bicycle N N 154 77 50
6 Bicycle N Y 349 174 49.9
7 Bicycle Y N 147 74 50.3
8 Bicycle Y Y 350 175 50
9 Blanket N N 134 67 50
10 Blanket N Y 349 174 49.9
# … with 30 more rows
By inspecting the data, we see that data frames with even numbers of observations result in an exact 50% sample, whereas data frames with odd numbers of observations are slightly above or below 50%.
A Base R solution
We can also solve the problem with Base R. This approach uses the three variables in the original post, product type, offer, and inventory to split the data into subgroups based on the combinations of values for these variables, take a random sample from each subset, and combine the result into a single data frame.
First, we set the seed for the random number generator and build a data frame containing 10,000 rows of products, where 50% of the products are on offer, and 70% are in inventory.
set.seed(1053807)
df <- data.frame(
productType = rep(c("Bed","Mouse","Table","Cellphone","Laptop","Car","Chair","Blanket",
"Sofa","Bicycle"),1000),
offer = ifelse(runif(10000) > .5,"Y","N"),
inventory = ifelse(runif(10000) > .3,"Y","N"),
price = rnorm(10000,200,10)
)
Since we want to separately sample each combination of product, offer, and inventory, we create a combined split variable, and then use it to split the data.
splitvar <- paste(df$productType,df$offer,df$inventory,sep="-")
dfList <- split(df,splitvar)
Given the input data frame parameters of 10 products, 2 levels of offer (Y / N), and 2 levels of inventory (Y / N), this creates a dfList object that is a list of 40 data frames, each with varying numbers of observations.
We then use lapply() to randomly select about 50% of each data frame, using the number of rows for each data frame to drive the sample() function.
sampledDataList <- lapply(dfList,function(x){
x[sample(nrow(x),size = round(.5 * nrow(x))),]
})
At this point the sampledDataList object is a list of 40 data frames, each of which has approximately 50% of the rows as the original list.
To create the final data frame, we use do.call() as follows.
sampledData <- do.call(rbind,sampledDataList)
When we check the number of observations in the resulting data frame, we see that it is approximately 50% of the original data size (10,000).
> # this should be approximately 5,000 rows
> nrow(sampledData)
[1] 5001
We can further verify that each data frame is approximately a 50% sample with the following code.
# verify sample percentage by stratum
stratum <- names(sampledDataList)
OriginalCount <- sapply(dfList,nrow)
SampledCount <- sapply(sampledDataList,nrow)
SamplePct <- round(SampledCount / OriginalCount * 100,2)
head(data.frame(stratum,OriginalCount,SampledCount,SamplePct,row.names = NULL),10)
...and the output:
> head(data.frame(stratum,OriginalCount,SampledCount,SamplePct,row.names = NULL),10)
stratum OriginalCount SampledCount SamplePct
1 Bed-N-N 161 80 49.69
2 Bed-N-Y 371 186 50.13
3 Bed-Y-N 132 66 50.00
4 Bed-Y-Y 336 168 50.00
5 Bicycle-N-N 154 77 50.00
6 Bicycle-N-Y 349 174 49.86
7 Bicycle-Y-N 147 74 50.34
8 Bicycle-Y-Y 350 175 50.00
9 Blanket-N-N 134 67 50.00
10 Blanket-N-Y 349 174 49.86
As was the case with the dplyr solution, we see that strata with odd numbers of rows either sample one more or one less than an exact 50% of the original data.

SPSS value labels as column names for tables in R?

I'm reading a .sav file using haven:
library(haven)
data <- read_spss("file.sav", user_na = FALSE)
Then trying to display one of the variables in a table:
table(data$region)
Which returns:
1 2 3 4 5 6 7 8 9 10 11 12
85 208 43 171 30 40 95 310 133 29 77 36
Which is technically correct, however - in SPSS, the numerical values in the top row have labels associated with them (region names in this case). If I just run data$region, it shows me the numbers and their associated labels at the end of the output, but is there a way to make those string labels appear in the first table row instead of their numerical counterparts?
Thank you in advance for your help!
The way to do this is to cast the variable as a factor, using the "labels" attribute of the vector as the factor levels. The sjlabelled package includes a function that does this in one step:
data$region <- sjlabelled::as_label(data$region)
While the table command will still work on the resulting data, the layout may be a little messy. The forcats package has a function that pretty-prints frequency tables for factors:
data$region %>% forcats::fct_count()

Converting contingency tables with counts to two-column data tables with frequency columns

I would like to enter a frequency table into an R data.table.
The data are in a format like this:
Height
Gender 3 35
m 173 125
f 323 198
... where the entries in the table (173, 125, etc.) are counts.
I have a 2 by 2 table, and I want to turn it into two-column data.table.
The data is from a study of birds who nest at a height. The question is whether different genders of the bird prefer certain heights.
I thought the frequency table should be turned into something like this:
Gender height N
m 3 173
m 35 125
f 3 323
f 35 198
but now I'm not so sure. Some of the models I want to run need every case itemized.
Can I do this conversion in R? Ideally, I'd like a way to switch back and forth between the two formats.
Based on a review of ?table.
Make a data frame (x) with columns for Gender, Height, and Freq which would be your N value.
Convert that to a table by using
tabledata <- xtabs(Freq ~ ., x)
There are a number of base functions that can work with this kind of data, which is obviously much more compact than individual rows.
Also from ?loglin this example using table.
loglin(HairEyeColor, list(c(1, 2), c(1, 3), c(2, 3)))
Thanks, everybody (#simon and #Elin) for the help. I thought I was conducting a poll that would get answers like "start with the 4-row version" or "start with the 719-row version" and you all have given me an entire toolbox of ways to move from one to the other. It's really great, informative, and way more than the inquiry deserves.
I unquestionably need to work harder and get more explicit in forming a question. I see by the -3 rating that this boondoggle has earned, crystallizing the fact that I'm not adding anything to the knowledge base, so will delete the question in order to keep future searchers from finding this. I've had a bad run recently with my questions, and as a former teacher of the year, writer of five books, and PhD statistician, it's extremely embarrassing to have been on Stack Exchange for as long as I have, and stand here with one reputation point. One. That means that my upvotes of your answers don't count for a thing.
That reputation point should be scarlet colored.
Here's what I was getting at:
In a book, a common way to express data is in a 2×2 table:
Height
Gender 3 35
M 173 175
F 323 198
My tic-tac-sized mind sees two ways of entering that into a data table:
require(data.table)
GENDER <- c("m","m","f","f")
HEIGHT <- c(3, 35, 3, 35)
N <- c(173, 125, 323, 198)
SANDFLIERS <-data.table(GENDER, HEIGHT, N)
That gives the four-line flat-file/tidy representation of the data:
GENDER HEIGHT N
1: m 3 173
2: m 35 125
3: f 3 323
4: f 35 198
The other option is to make a 719-row data table with 173 male#3ft, 125 male#35 feet, etc. It's not too bad if you use the rep() command and build your table columns carefully. I hate doing arithmetic, so I leave some of these numbers bare and untotaled.
# I need 173+125 males, and 323+198 females.
# One c(rep()) for "m", one c(rep() for "f", and one c() to merge them
gender <- c(c(rep("m", 173+25)), c(rep("f",(323+198))))
# Same here, except the c() functions are one level 'deeper'. I need two
# sets for males (at heights 3 and 35, 173 and 125 of each, respectively)
# and two sets for females (at heights 3 and 35, 323 and 198 respectively)
heights <-c(c(c(rep(3, 173)), c(rep(35,25))), c(c(rep(3, 323)), c(rep(35,198))))
which, when merged into a data.table gives 719 rows, one for each observed bird.
1: m 3
2: m 3
3: m 3
4: m 3
5: m 3
---
715: f 35
716: f 35
717: f 35
718: f 35
719: f 35
Now that I have the data in two formats, I start looking for ways to do plots and analyses.
I can get a mosaic plot using the 719-row version, but you can't see it because of my 1-point reputation
mosaicplot(table(sandfliers), COLOR=TRUE, margin, legend=TRUE)
Mosaic Plot
and you can get a balloon plot using the 4-row version
Balloon Plot
So my question was, for those of you with lots and lots of experience with this sort of thing, do you find the 4-row or the 719-row tables more common. I can change from one to the other, but that's more code to add to the book (again I hear my editor, "You're teaching statistics, not R").
So, as I said at the top, this was just an informal poll on whether one is used more often than the other, or whether beginners are better off with one.
This is in the form of a contingency table. It isn't easy to enter directly into R but it can be done as follows (based on http://cyclismo.org/tutorial/R/tables.html):
> f <- matrix(c(173,125,323,198),nrow=2,byrow=TRUE)
> colnames(f) <- c(3,35)
> rownames(f) <- c("m","f")
> f <- as.table(f)
> f
3 35
m 173 125
f 323 198
You can then create a count or frequency table with:
> as.data.frame(f)
Var1 Var2 Freq
1 m 3 173
2 f 3 323
3 m 35 125
4 f 35 198
The R Cookbook gives a short function to convert to a table of cases (i.e. a long list of the individual items), as follows:
> countsToCases(as.data.frame(f))
... where:
# Convert from data frame of counts to data frame of cases.
# `countcol` is the name of the column containing the counts
countsToCases <- function(x, countcol = "Freq") {
# Get the row indices to pull from x
idx <- rep.int(seq_len(nrow(x)), x[[countcol]])
# Drop count column
x[[countcol]] <- NULL
# Get the rows from x
x[idx, ]
}
... thus you can convert the data to the format needed by any analysis method from any starting format.
(EDIT)
Another way to read in the contingency table is to start with text like this:
> ss <- " 3 35
+ m 173 125
+ f 323 198"
> read.table(text=ss,row.name=1)
X3 X35
m 173 125
f 323 198
Instead of using text =, you can also use a file name to read the table from (for example) a CSV file.

setting variable value by subsetting

this is my first question, so please bear with me
I am creating a new variable age.f.sex in my dataframe wm.13 using an already existing variable SB1. In the original dataframe, SB1 indicates the age of first sexual intercourse of women interviewed in UNICEF's Multiple Indicators Cluster Surveys. The values that SB1 can take are:
> sort(unique(wm.13$SB1))
[1] 0 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
[26] 30 31 32 33 34 35 36 37 38 39 40 41 42 44 48 95 97 99
Here is the meaning of the values SB1 can take
0 means she never had sex
97 and 99 mean "does not remember/does not know"
95 means that she had her first sexual intercourse when she started living with her husband/partner (for which there is a specific variable, i.e MA9)
Any number between 0 and 95 is the declared age ate their first sexual intercourse
there are also NAs that sort() does not show but they appear if I just use unique()
I created a new variable from SB1, which I called age.f.sex.
wm.13$age.f.sex <- wm.13$SB1
I had the 0, 97 and 99 values replaced with NAs, and I kept the original NAs in SB1. I did this using the following code:
wm.13$age.f.sex[wm.13$SB1 == 0] <- NA
wm.13$age.f.sex[wm.13$SB1 == 97] <- NA
wm.13$age.f.sex[wm.13$SB1 == 99] <- NA
wm.13$age.f.sex[is.na(wm.13$SB1)] <- NA
Everything worked fine until here. However, I am in trouble with the 95 value. I want to code so that the observations that have value 95 in SB1 (i.e. the age of first sexual intercourse) will have the value from MA9 (i.e. the age when the woman started living with her partner/husband) in my new variable age.f.sex.
I first started with this code
> wm.13$age.f.sex[wm.13$SB1 == 95] <- wm.13$MA9
but i got the following error message
Error in wm.13$age.f.sex[wm.13$SB1 == 95] <- wm.13$MA9 :
NAs are not allowed in subscripted assignments
After some researches in this website, I realised that I might need to subset the right-hand side of the code too, but honestly I do not know how to do it. I have a feeling that which() or if.else() might come of use here, but I cannot figure out their argument. Examples I have found in this website show how to impute one specific value, but I could not find anything on subsetting according to the value the observations take in another variable.
I hope I have been clear enough. Any suggestion will be much appreciated.
Thanks, Manolo
Perhaps you could try:
wm.13$age.f.sex <- ifelse(wm.13$SB1 %in% c(0,97,99) | is.na(wm.13$SB1), NA, ifelse(wm.13$SB1 == 95, wm.13$MA9, wm.13$SB1))
In short, it works like this: The code checks whether wm.13$SB1 is 0, 97, 99 or missing, and then returns NA. Subsequently, it checks whether wm.13$SB1 is 95, and if so, it returns the value on that row in the MA9 column. In all other cases it returns the SB1 value. Because of "wm.13$age.f.sex <-" at the beginning of the line the return values are assigned to your new age.f.sex variable.
As the error message indicates, it is not possible to do subscripted assignments when the filter contains NAs. A way to circumvent this is to explicitly include NA as a factor level. The following example illustrates a possible way to replace 95s by their corresponding value in a second column.
# example dataframe
df <- data.frame(a = c(NA, 3, 95, NA),
b = 1:4)
# set a to factor with NA as one of the levels (besides those in a and b)
df$a <- factor(df$a, levels = union(df$a, df$b), exclude = NULL)
# subscripted assignment (don't forget to filter b too!)
df$a[df$a == 95] <- df$b[df$a == 95]
# restore to numeric
df$a <- as.numeric(as.character(df$a))

Resources