I have data that looks like this:
+-------------+------------+------------------+-------------------+------------------+
| gender | age | income | ate_string_cheese | tech_familiarity |
+-------------+------------+------------------+-------------------+------------------+
| A. Female | D. 45-54 | B. $50K - $80K | B. Once or twice | A. Low |
| A. Female | C. 35-44 | A. $35K - $49K | B. Once or twice | B. Medium |
| B. Male | B. 25-34 | B. 50k - 79,999 | B. Once or twice | C. High |
| A. Female | A. 18-24 | D. $100k - $149k | B. Once or twice | B. Medium |
+-------------+------------+------------------+-------------------+------------------+
I want to try to find correlations between different observations. I need the values to be numerical. I'm wondering if there's an easy way to do this in R?
To be clear the result from above would look like this:
+--------+-----+--------+-------------------+------------------+
| gender | age | income | ate_string_cheese | tech_familiarity |
+--------+-----+--------+-------------------+------------------+
| 1 | 4 | 2 | 2 | 1 |
| 1 | 3 | 1 | 2 | 2 |
| 2 | 2 | 2 | 2 | 3 |
| 1 | 1 | 4 | 2 | 2 |
+--------+-----+--------+-------------------+------------------+
I'm assuming there must be a package for this, but I can't find the Google incantation that will conjure it. Please know that I'm a complete statistic newbie who's just poking around. So if you prod me for more details, I likely won't have an educated answer to return.
To answer your question about converting categorical data into numerical data in R:
You can convert character data into factor using as.factor()
factor returns an object of class "factor" which has a set of integer codes the length of x with a "levels" attribute of mode character.
Pros:
This will encode your data numerically with an attribute that maps the character value for reference.
Factors can be ordered which can capture important information about ordinal data (such as age bands in your case)
Cons:
Beware converting categorical data into numeric for the purposes of performing statistical analysis on the data. The numerical values are probably not on the interval or ratio scale for all questions, so taking things like the mean or difference between levels may not make sense. e.g. consider if the distance between each level is actually constant, does it have a natural zero point etc.
You need to just extract first character, convert it to lowercase and map it with number:
# Your original data frame
df=read.table(text="gender;age;income;ate_string_cheese;tech_familiarity
A. Female;D.45-54;B.$50K - $80K;B.Once or twice;A.Low
A. Female;C.35-44;A.$35K - $49K;B.Once or twice;B. Medium
B. Male;B.25-34;B.50k - 79,999;B.Once or twice;C. High
A. Female;A. 18-24;D.$100k - $149k;B.Once or twice;B. Medium",header=T,sep=";")
myLetters <- letters[1:26]
# Apply match function to df, convert to lowercase and map it with number
sapply(df, function(x) match(tolower(gsub("([A-Za-z]+).*", "\\1", x)), myLetters))
Output:
gender age income ate_string_cheese tech_familiarity
[1,] 1 4 2 2 1
[2,] 1 3 1 2 2
[3,] 2 2 2 2 3
[4,] 1 1 4 2 2
You could trim the whitepace, and just grab the A,B,C,D parts and call factor on each column with level=LETTERS[1:4] and labels=1:4.
structure(factor(sub('\\..*','',trimws(as.matrix(df))),labels=1:4),.Dim=dim(df),dimnames=dimnames(df))
gender age income ate_string_cheese tech_familiarity
1 1 4 2 2 1
2 1 3 1 2 2
3 2 2 2 2 3
4 1 1 4 2 2
This is a matrix. You can convert to a dataframe
We can convert the columns to factor and coerce it to numeric
df[] <- lapply(df, function(x) as.integer(factor(x)))
Related
I have three dataframes, df1, df2, and df3. They each have columns date,A, B, C. I made sure that all the date columns in each of the dataframes were of the same class (Date), all the A columns in each of the dataframes were of the same class (factor), and the classes of the other two were integers. I used ungroup() to get rid of any grouping I had used in the dataframes.
So the dataframes look something like:
- Date | A | B | C
1 2020-01-01 | House | 1 | 2
2 2020-01-02 | House | 3 | 4
3 2020-01-03 | House | 5 | 6
- Date | A | B | C
1 2020-01-01 | Field | 1 | 2
2 2020-01-02 | Field | 3 | 4
3 2020-01-03 | Field | 5 | 6
- Date | A | B | C
1 2020-01-01 | Store | 1 | 2
2 2020-01-02 | Store | 3 | 4
3 2020-01-03 | Store | 5 | 6
Then I tried to use new_df <- bind_rows(df1, df2, df3) to append them into a single dataframe (new_df). No errors appear. I used new_df$A <– as.factor(new_df$C=A) to ensure that A was indeed a factor.
new_df looks something like:
- Date | A | B | C
1 2020-01-01 | House | 1 | 2
2 2020-01-02 | House | 3 | 4
3 2020-01-03 | House | 5 | 6
4 2020-01-01 | Field | 1 | 2
5 2020-01-02 | Field | 3 | 4
6 2020-01-03 | Field | 5 | 6
7 2020-01-01 | Store | 1 | 2
8 2020-01-02 | Store | 3 | 4
9 2020-01-03 | Store | 5 | 6
Then I made a ggplot with graph <– ggplot(data = new_df, aes(x = B, y = C, group = A) + geom_line() + transition_reveal(date). No errors appear. I checked the plot data, and the data was of type double (S3: Date), factor, integer, and integer (as expected).
When I go to print the plot with animate(graph, renderer=gifski_renderer("graph.gif") by pressing 'run current chunk', the graph prints nicely and I can open the gif file.
However, when I press Knit, with an output to a pdf_document, the following error appears:
Error: Unsupported device In addition: Warning messages:
1: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
2: In bind_rows_(x, .id) : binding character and factor vector, coercing into character vector
3: In bind_rows_(x, .id) : binding character and factor vector, coercing into character vector
4: In bind_rows_(x, .id) : binding character and factor vector, coercing into character vector Execution halted.
The error is traced back to the line where I used the animate() function.
I'm not sure what the issue is because I can't spot where the unequal factors are, but it is preventing me from getting my desired output.
Any thoughts?
In a survey, there was a question that asked "what aspect of the course helped you learn concepts the most? Select all that apply"
Here is what the list of responses looked like:
Student_ID = c(1,2,3)
Responses = c("lectures,tutorials","tutorials,assignments,lectures", "assignments,presentations,tutorials")
Grades = c(1.1,1.2,1.3)
Data = data.frame(Student_ID,Responses,Grades);Data
Student_ID | Responses | Grades
1 | lectures,tutorials | 1.1
2 | tutorials,assignments,lectures | 1.2
3 | assignments,presentations,tutorials | 1.3
Now I want to create a data frame that looks something like this
Student_ID | Lectures | Tutorials | Assignments | Presentation | Grades
1 | 1 | 1 | 0 | 0 | 1.3
2 | 1 | 1 | 1 | 0 | 1.4
3 | 0 | 1 | 1 | 1 | 1.3
I managed to separate the comma separated responses into columns, using the splitstackshape package. So currently my data looks like this:
Student ID | Response 1 | Response 2 | Response 3 | Response 4 | Grades
1 | lectures | tutorials | NA | NA | 1.1
2 | tutorials | assignments | lectures | NA | 1.2
3 | assignments| presentation| tutorials | NA | 1.3
But as I stated earlier, I would like my table to look like the way I presented above, in dummy codes. I am stuck on how to proceed. Perhaps an idea is to go through each observation in the columns and append 1 or 0 to a new data frame with lectures,tutorials,assignments,presentation as the headers?
First the Response column is converted from factor to character class. Each element of that column is then split on comma. I don't know what all the possible responses are, so I used all that are present. Next the split Response column is tabulated, specifying the possible levels. The resulting list is converted into a matrix before being mixed into the old data.frame.
Data$Responses <- as.character(Data$Responses)
resp.split <- strsplit(Data$Responses, ",")
lev <- unique(unlist(resp.split))
resp.dummy <- lapply(resp.split, function(x) table(factor(x, levels=lev)))
Data2 <- with(Data, data.frame(Student_ID, do.call(rbind, resp.dummy), Grades))
Data2
# Student_ID lectures tutorials assignments presentations Grades
# 1 1 1 1 0 0 1.1
# 2 2 1 1 1 0 1.2
# 3 3 0 1 1 1 1.3
I found a response to my question. I initially did
library(splitstackshape)
Responses = cSplit(Data, "Responses",",")
Then I added the following line:
library(qdapTools)
TA <- mtabulate(as.data.frame(t(TA)))
It worked for me.
N* [1]| [2] | [3]
1* | 3 | 20 | 3 |
2* | 2 | 10 | 3 |
3* | 3 | 25 | 3 |
4* | 1 | 15 | 3 |
5* | 3 | 30 | 3 |
Can you help me to get a sum of second column, but only sum of elements that has 3 in the first row. For example in that matrix it is 20+25+30=75. In a fastest way (it's actually big matrix).
P.S. I tried something like this with(Train, sum(Column2[,"Date"] == i))
As you can see I need sum Of Colomn2 where date has certain meaning (from 1 to 12)
We can create a logical index with the first column and use that to subset the second column and get the sum
sum(m1[m1[,1]==3,2])
EDIT: Based on #Richard Scriven's comment.
I am using the DGET function in LibreOffice. I have the first table as shown below (top). I want to make second table (bottom). I can use DGET function where Database is the cell range containing top table and Database Field is "Winner".
Is it possible to have different cell ranges in Search Criteria, so that for each cell in row for Case #1 can have separate formula with a different search criteria as given in the first row of bottom table?
If I have to use separate continuous cell ranges for search criteria, then there would be [n*Chances] cell ranges, where n=total number of cases (~150 in my case) and Chances = possible number of Chance# (50 in my case).
Case | Chance# | Winner
-------------------------
1 | 7 | Joe
1 | 9 | Emil
1 | 10 | Harry
1 | 11 | Kate
2 | 1 | Tom
2 | 3 | Jerry
2 | 4 | Mike
2 | 7 | John
Case |Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|
|="=1" |="=2" |="=3" |="=4" |="=5" |="=6" |="=7" |="=8" |="=9" |="=10" |="=11" | ---- |="=50"
1 | | | | | | | Joe | |Emil |Harry | Kate | ---- |
2 | Tom | |Jerry |Mike | | | John | | | | | ---- |
To do so, you need to change your approach, instead of using DGET, I'm using a rather more complex method:
Considering your example:
A B C D
1 # Case Chance# Winner
2 1 1 7 Joe
3 2 1 9 Emil
4 3 1 10 Harry
5 4 1 11 Kate
6 5 2 1 Tom
7 6 2 3 Jerry
8 7 2 4 Mike
9 8 2 7 John
10
11 Case\Chance# 1 2 3 4
12 1
13 2 Tom Jerry Mike
I use the following:
=IF(SUMPRODUCT(($B$2:$B$9=$A12)*($C$2:$C$9=B$11)*($A$2:$A$9))> 0,INDEX($D$2:$D$9,SUMPRODUCT(($B$2:$B$9=$A12)*($C$2:$C$9=B$11)*($A$2:$A$9))),"")
Let's ignore the IF, and focus on the real deal here:
First, Get the row that matches your condition, $B$2:$B$9=$A12 and $C$2:$C$9=B$11 will result in a TRUE/FALSE arrays, multiply them to get a 0/1 array with only a single 1 for the match, now multiply by the ID to get the row number in your table.
SUMPRODUCT will get you a single value (the row) from the result array.
Finally use index to retrieve the desired value.
The IF statement tests if a match do exist (SUMPRODUCT > 0), to filter out the cell with no match.
When considering time dependent data in survival analysis, you have multiple start-stop times for an individual subject with measurements for the covariates as each start-stop time. How does the coxph function keeps track of which subject it is associating the start and stop times along with the covariates?
The function looks as follows
coxph(Surv(start, stop, event, type) ~ X)
Your data may look as follows
subject | start | stop | event | covariate |
--------+---------+--------+--------+-----------+
1 | 1 | 7 | 0 | 2 |
1 | 7 | 14 | 0 | 3 |
1 | 14 | 17 | 1 | 6 |
2 | 1 | 7 | 0 | 1 |
2 | 7 | 14 | 0 | 1 |
2 | 14 | 21 | 0 | 2 |
3 | 1 | 3 | 1 | 8 |
How can the function get away without an individual subject specifier?
My understanding is that survival analysis is not interested in individuals through time, it is looking at total counts for each time point, so the subject specifier is irrelevant. Instead, based on the counts, probabilities can be estimated that any particular subject will be alive/dead at a certain time given certain treatments.