I have a dataset called dolls.csv that I imported using
dolls <- read.csv("dolls.csv")
This is a snippet of the data
Name Review Year Strong Skinny Weak Fat Normal
Bell 3.5 1990 1 1 0 0 0
Jan 7.2 1997 0 0 1 0 1
Tweet 7.6 1987 1 1 0 0 0
Sall 9.5 2005 0 0 0 1 0
I am trying to run some preliminary analysis of this data. The Name is the name of the doll, the review is a rating 1-10, year is year made, and all values after that are binary where they are 1 if they possess a characteristic or 0 if they don't.
I ran
summary(dolls)
and get the header, means, mins and max's of values.
I am trying to possibly see what the correlations are between characteristics and year or review rating to see if there is some correlation (for example to see if certain dolls have really high ratings yet have unfavorable traits ), not sure how to construct charts or what functions to use in this case? I was considering some ANOVA tail testing for outliers and means of different values but not sure how to compare values like this (In python i'd run a if-then statement but i dont know how to in R).
This is for a personal study I wanted to conduct and improve my R skills.
Thank you!
Related
(I'm SUPER new to coding in general so all suggestions are much appreciated.)
So I'm working with a data set that contains panel survey data that was posed to the same 8000 participants 7 times over the course of the last decade. I currently have dummy variable forms for the answers I'm interested in, so now my data is looks like this:
colour2011
colour2016
colour2018
1
1
0
0
0
0
0
1
1
1
0
0
1
1
1
and the other variable's data looks similar with column names being tied to the year the question was asked. Is there a way to not only show change of answer for both using ggplot2, but also track rate of change and display that visually by year?
I am currently doing k-means to cluster my data, however, I wish each cluster to appear once in each given year. I have searched for answers for a whole night but with no result. Would anyone have ideas upon this problem using R? Or is there any package I should look for ? Thanks.
More background infos :
I try to replicated the cluster of relationships, using the reported gender, education level and birth year. I am doing this because this is a survey data whose respondents are old people and they sometime will report inaccurate age or education infos. My main challenge now is that I wish to "have only one cluster labels in each survey year". For example, I do not want to see there are two cluster3 in survey year 2000. My data is like below :
survey year
relationship
gender
education level
birth year
k-means cluster
2000
41( first daughter)
0
3
1997
1
2003
41( first daughter)
0
3
1997
1
2000
42( second daughter)
0
4
1999
2
2003
42( second daughter)
0
4
1999
2
2000
42( third daughter)
0
5
1999
2
2003
42( third daughter)
0
5
2001
3
Thanks in advance.
--Update--
A more detailed description of the task:
The data set is a panel survey data asking elders for their health status, their relationships ( incl. sons, daughters, neighbors ). Since these older people are sometimes imprecise on their family's demographic information such as birth year, education level, etc., we might need to delete a big part of the data if it did not match.
(e.g., A reported his first son is 30 years old in 1997, while said his first son was 29 years old in 1999, this data could therefore be problematic). My task is to save as much data as possible if the imprecision is not that high.
Therefore I first mutated columns to check the precision of each family member (e.g., birth year error %in% c(-1,2)). Next, I run k-means if the family members are detected to be imprecise. In this way, I save much of the data. Although I did not solve the above problem, it rarely occurs that I can almost ignore or drop these observations.
Hi everyone ! I hope you are having a great day.
Aim and context
My two dataframes are built from different methods, but measure the same parameters on the same signals.
I’d like to match every signals in the first dataframe with the same signal in the second dataframe, to compare the parameter values, and evaluate the methods against each other.
I would gratefully appreciate any help, as I reached my beginner’s limits in R coding but also in dataframe management.
Basically, I would like to find matches in two separate dataframes and consider that the matches are refering to the same entity (for instance along the creation an ID variable), in order to perform statistical analysis for paired data.
I could have made the matches by hand on a spreadsheet, but because there are hundreds of entries and more comparisons to come, I’d like to automatize the matching and creation of dataframe.
To give you an idea, my dataframes look like this :
DF1
Recording
Selection
Start (ms)
Freq.max (kHz)
001
1
11.3
42.4
001
2
122.9
46.2
001
3
232.3
47.5
002
1
22.9
30.9
002
2
512.4
31.3
My second dataframe would look something like this :
DF2
Recording
Selection
Start (ms)
Freq.max (kHz)
001
1
10.9
41.8
001
2
122.1
44.5
001
3
231.3
44.4
002
1
513.0
30.2
My ideas
I thought about identifying each signal, but
An ID using "Recording + selection" (001_1, 001_2...) would not work because some signals are not detected in both methods.
So I'd want to use the start position to identify the signals, but rounding to the closest or upper/lower value would not match all the signals.
Hmisc::find.matches() function
I tried the function find.matches() from the package Hmisc, that gives the matches of your columns, given the tolerance threshold you input.
find <- find.matches(DF_method1$start_one, DF_method2$start_two, tol=(2))
(I arbitrarily chose a tolerance of 2 ms, for it to be considered as the same signal)
The output looks like this :
Matches:
Match #1 Match #2 Match #3
[1,] 1 7 0
[2,] 2 42 0
[3,] 3 0 0
[4,] 4 0 0
[5,] 0 0 0
[6,] 5 0 0
[7,] 22 6 0
I feel like it is coming together but I am stuck to these two questions :
How to find the closest match among each recording, not comparing all signals in all recordings ? (example here, all 1st matches are correctly identified, except n°7, matched with n°22, which is from a different recording) How could I run the function, within each recording ?
How to create a dataframe from the output ? Which would be a dataframe with only the signals that had a match and their related parameter values.
I feel like this function gets close to my aim but if you have any other suggestion, I am all ears.
Thanks a lot
I have criminal sentencing data that contains a text variable which contains phrases like "2 months jail", "14 months prison", "12 months community supervision." I would like to run a logistic regression to determine the odds that a particular defendant is sent to prison or jail, or if they were released to community supervision. So I want to create a binary variable that shows a 1 for someone sent to "jail"/"prison" and a 0 for those sent to another program
I have tried using library(qdap) but have not had any luck. I have also tried ifelse(df$text %in% "jail", "1", "0") but it only shows 1 observation when I know there are several thousand.
Small data sample:
data<-data.frame('caseid'=c(1,2,3),'text'=c("went to prison","went to jail","released"))
caseid text
1 1 went to prison
2 2 went to jail
3 3 released
Trying to create a binary variable - sentenced - to analyze logistically like:
caseid text sentenced
1 1 went to prison 1
2 2 went to jail 1
3 3 released 0
Thank you for any help you can offer!
You can do the following in base R
transform(data, sentenced = +grepl("(jail|prison)", text))
# caseid text sentenced
#1 1 went to prison 1
#2 2 went to jail 1
#3 3 released 0
Explanation: "(jail|prison)" matches "jail" or "prison", and the unary operator + turns the output of grepl into an integer.
Can anyone tell me how to constrain the output and selected variables of a neural network such that the influence of a charateristic is positive using the function nnet in R. I Have a database (real estate) with numerical (surface, price) and categorial values (parking Y/N, areacode, ectera). The output of the model is the price. The thing is that the model currently estimates that in a few areacodes the homes with a parking spot are less worth than the homes without a parking spot. I would like to constrain the output (Price) so that in each areacode, the influence of a parking spot on the price is positive. Ofcourse a really small house with parking spot can still be cheaper than a big house without a parking spot.
example data (of 80.000 observations):
Price Surface Parking Y Areacode 1 Areacode 2 Areacode 3
100000 100 0 1 0 0
110000 99 1 0 1 0
200000 110 0 0 0 1
150000 130 0 0 1 0
190000 130 1 0 0 1
(thanks for putting the table in a decent format)
I had this modelled in R using nnet.
model = nnet(Price~ . , data=data6, MaxNWts=2500, size=12, skip=TRUE, linout=TRUE, decay=0.025, na.action=na.omit)
I used nnet because I hope to find different values for parking spots per area code. If there is a beter way for this please let us know.
Im using RStudio Version 0.98.976 on windows XP (yes i know;)
Thanks in advance for your replies