I have the following data frame in R.
ID | Year_Month | Amount
10001|2021-06 | 85
10001|2021-07 | 32.0
20032|2021-08 | 63
20032|2021-09 | 44.23
20033|2021-11 | 10.90
I would like to transform this data to look something like this:
ID | 2021-06 | 2021-07 |2021-08 | 2021-09 | 2021-11
10001| 85 | 32 | 0 | 0 | 0
20032| 0 | 0 | 63 | 44.23 | 0
20033| 0 | 0 | 0 | 0 | 10.90
The totals will be on the columns based on the Year_Month column. Can someone help? I have tried using transpose but it did not work.
You should check out tidyverse package, it has some really good functions for data wrangling.
## Loading the required libraries
library(dplyr)
library(tidyverse)
## Creating the dataframe
df = data.frame(ID=c(10001,10001,20032,20032,20033),
Date=c('2021-06','2021-07','2021-08','2021-09','2021-11'),
Amount = c(85,32,63,44.2,10.9))
## Pivot longer to wider
df_pivot = df %>%
pivot_wider(names_from = Date, values_from = c(Amount))
## Replacing NA with 0
df_pivot[is.na(df_pivot)] = 0
df_pivot
# A tibble: 3 x 6
ID `2021-06` `2021-07` `2021-08` `2021-09` `2021-11`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10001 85 32 0 0 0
2 20032 0 0 63 44.2 0
3 20033 0 0 0 0 10.9
Related
I have a dataset and I need to filter bad/invalid data for my compute. I tried percentile but that will filter genuine value
Eg: 1, 2,8,10,20,25,55,100,100000,98,99,95
Here, 100000 is corrupted/bad, When I call Max () function, I expect 100 instead of 100000.
series_outliers()
datatable(val:int)[1, 2,8,10,20,25,55,100,100000,98,99,95]
| summarize val = make_list(val)
| extend anomaly_score = series_outliers(val)
| mv-expand val to typeof(int), anomaly_score to typeof(real);
//| where anomaly_score between (-1.5 .. 1.5)
val
anomaly_score
1
-0.045592692382452886
2
-0.017097259643419842
8
0
10
0
20
0
25
0
55
0
100
0.0028495432739035469
100000
2846.6965801726751
98
0
99
0
95
0
Fiddle
I am wondering if the stem function in R is producing the stem and leaf plot correctly for this example. The code
X <- c(rep(1,1000),2:15)
stem(X,width = 20)
produces the output
The decimal point is at the |
1 | 00000000+980
2 | 0
3 | 0
4 | 0
5 | 0
6 | 0
7 | 0
8 | 0
9 | 0
10 | 0
11 | 0
12 | 0
13 | 0
14 | 0
15 | 0
There are 1000 ones in the data, but the output of the stem function seems to indicate that there are 988 ones (if you count the zeros in the first row and add 980). Instead of +980, I think it should display +992 at the end of the first row.
Is there an error in the stem function or am I not reading the output correctly?
I have one dataset which includes all the points of students and other variables.
I further have a diagonal matrix which includes information on which student is a peer of another student.
Now I would like to use the second matrix (network) to calculate the mean-peer-points for each student. Everyone can have different (number of) peers.
To calculate the mean, I recalculated the simple 0,1 matrix into percentages, whereby the denominator is the sum of the number of peers one student has.
The second matrix then would look something like this:
ID1 ID2 ID3 ID4 ID5
ID1 0 0 0 0 1
ID2 0 0 0.5 0.5 0
ID3 0 0.5 0 0 0.5
ID4 0 0.5 0 0 0.5
ID5 0.33 0 0.33 0.33 0
And the points of each students is a simple variable in another dataset, and I would like to have the peers-average-points in as a second variable:
ID Points Peers
ID1 45 11
ID2 42 33.5
ID3 25 26.5
ID4 60 26.5
ID5 11 43.33
Are there any commands in Stata for that problem? I am currently looking into the Stata commands nwcommands, but I am unsure whether it can help. I could use solutions for Stata and R.
Without getting too creative, you can accomplish what you are trying to do with reshape, collapse and a couple of merges in Stata. Generally speaking, data in long format is easier to work with for this type of exercise.
Below is an example which produces the desired result.
/* Set-up data for example */
clear
input int(id points)
1 45
2 42
3 25
4 60
5 11
end
tempfile points
save `points'
clear
input int(StudentId id1 id2 id3 id4 id5)
1 0 0 0 0 1
2 0 0 1 1 0
3 0 1 0 0 1
4 0 1 0 0 1
5 1 0 1 1 0
end
/* End data set-up */
* Reshape peers data to long form
reshape long id, i(Student) j(PeerId)
drop if id == 0 // drop if student is not a peer of `StudentId`
* create id variable to use in merge
replace id = PeerId
* Merge to points data to get peer points
merge m:1 id using `points', nogen
* collapse data to the student level, sum peer points
collapse (sum) PeerPoints = points (count) CountPeers = PeerId, by(StudentId)
* merge back to points data to get student points
rename StudentId id
merge 1:1 id using `points', nogen
gen peers = PeerPoints / CountPeers
li id points peers
+------------------------+
| id points peers |
|------------------------|
1. | 1 45 11 |
2. | 2 42 42.5 |
3. | 3 25 26.5 |
4. | 4 60 26.5 |
5. | 5 11 43.33333
+------------------------+
In the above code, I reshape your peer data into long form data and keep only student-peer pairs. I then merge this data to the points data to get the points of each students peers. From here, I collapse the data back to the student level, totaling peer points and peer count in the process. At this point, you have total points for the peers of each student and the number of peers each student has. Now, you simply have to merge back to the points data to get the subject students points and divide total peer points (PeerPoints) by the number of peers the student has (CountPeers) for average peer points.
nwcommands is an outstanding package I have never used or studied, so I will just try the problem from first principles. This is all matrix algebra, but given a matrix and a variable, I would approach it like this in Stata.
clear
scalar third = 1/3
mat M = (0,0,0,0,1\0,0,0.5,0.5,0\0,0.5,0,0,0.5\0,0.5,0,0,0.5\third,0,third,third,0)
input ID Points Peers
1 45 11
2 42 33.5
3 25 26.5
4 60 26.5
5 11 43.33
end
gen Wanted = 0
quietly forval i = 1/5 {
forval j = 1/5 {
replace Wanted = Wanted + M[`i', `j'] * Points[`j'] in `i'
}
}
list
+--------------------------------+
| ID Points Peers Wanted |
|--------------------------------|
1. | 1 45 11 11 |
2. | 2 42 33.5 42.5 |
3. | 3 25 26.5 26.5 |
4. | 4 60 26.5 26.5 |
5. | 5 11 43.33 43.33334 |
+--------------------------------+
Small points: Using 0.33 for 1/3 doesn't give enough precision. You'll have similar problems for 1/6 and 1/7, for example.
Also, I get that the peers of 2 are 3 and 4 so their average is (25 + 60)/2 = 42.5, not 33.5.
EDIT: A similar approach starts with a data structure very like that imagined by #ander2ed
clear
input int(id points id1 id2 id3 id4 id5)
1 45 0 0 0 0 1
2 42 0 0 1 1 0
3 25 0 1 0 0 1
4 60 0 1 0 0 1
5 11 1 0 1 1 0
end
gen wanted = 0
quietly forval i = 1/5 {
forval j = 1/5 {
replace wanted = wanted + id`j'[`i'] * points[`j'] in `i'
}
}
egen count = rowtotal(id1-id5)
replace wanted = wanted/count
list
+--------------------------------------------------------------+
| id points id1 id2 id3 id4 id5 wanted count |
|--------------------------------------------------------------|
1. | 1 45 0 0 0 0 1 11 1 |
2. | 2 42 0 0 1 1 0 42.5 2 |
3. | 3 25 0 1 0 0 1 26.5 2 |
4. | 4 60 0 1 0 0 1 26.5 2 |
5. | 5 11 1 0 1 1 0 43.33333 3 |
+--------------------------------------------------------------+
I read a similar post related to this problem, but I am afraid this error code is due something else. I have a CSV file with 8-observation and 10 variables:
> str(rorIn)
'data.frame': 8 obs. of 10 variables:
$ Acuity : Factor w/ 3 levels "Elective ","Emergency ",..: 1 1 2 2 1 2 2 3
$ AgeInYears : int 49 56 77 65 51 79 67 63
$ IsPriority : int 0 0 1 0 0 1 0 1
$ AuthorizationStatus: Factor w/ 1 level "APPROVED ": 1 1 1 1 1 1 1 1
$ iscasemanagement : Factor w/ 2 levels "N","Y": 1 1 2 1 1 2 2 2
$ iseligible : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1
$ referralservicecode: Factor w/ 4 levels "12345","278",..: 4 1 3 1 1 2 3 1
$ IsHighlight : Factor w/ 1 level "N": 1 1 1 1 1 1 1 1
$ RealLengthOfStay : int 25 1 1 1 2 2 1 3
$ Readmit : Factor w/ 2 levels "0","1": 2 1 2 1 2 1 2 1
I invoke the algorithm like this:
library("C50")
rorIn <- read.csv(file = "RoRdataInputData_v1.6.csv", header = TRUE, quote = "\"")
rorIn$Readmit <- factor(rorIn$Readmit)
fit <- C5.0(Readmit~., data= rorIn)
Then I get:
> source("~/R-workspace/src/RoR/RoR/testing.R")
c50 code called exit with value 1
>
I am following other recommendations such as:
- Using a factor as the decision variable
- Avoiding empty data
Any help on this?, I read this is one of the best algorithm for machine learning, but I get this error all the time.
Here is the original dataset:
Acuity,AgeInYears,IsPriority,AuthorizationStatus,iscasemanagement,iseligible,referralservicecode,IsHighlight,RealLengthOfStay,Readmit
Elective ,49,0,APPROVED ,N,Y,SNF ,N,25,1
Elective ,56,0,APPROVED ,N,Y,12345,N,1,0
Emergency ,77,1,APPROVED ,Y,Y,OBSERVE ,N,1,1
Emergency ,65,0,APPROVED ,N,Y,12345,N,1,0
Elective ,51,0,APPROVED ,N,Y,12345,N,2,1
Emergency ,79,1,APPROVED ,Y,Y,278,N,2,0
Emergency ,67,0,APPROVED ,Y,Y,OBSERVE ,N,1,1
Urgent ,63,1,APPROVED ,Y,Y,12345,N,3,0
Thanks in advance for any help,
David
You need to clean your data in a few ways.
Remove the unnecessary columns with only one level. They contain no information and lead to problems.
Convert the class of the target variable rorIn$Readmit into a factor.
Separate the target variable from the data set that you supply for the training.
This should work:
rorIn <- read.csv("RoRdataInputData_v1.6.csv", header=TRUE)
rorIn$Readmit <- as.factor(rorIn$Readmit)
library(Hmisc)
singleLevelVars <- names(rorIn)[contents(rorIn)$contents$Levels == 1]
trainvars <- setdiff(colnames(rorIn), c("Readmit", singleLevelVars))
library(C50)
RoRmodel <- C5.0(rorIn[,trainvars], rorIn$Readmit,trials = 10)
predict(RoRmodel, rorIn[,trainvars])
#[1] 1 0 1 0 0 0 1 0
#Levels: 0 1
You can then evaluate accuracy, recall, and other statistics by comparing this predicted result with the actual value of the target variable:
rorIn$Readmit
#[1] 1 0 1 0 1 0 1 0
#Levels: 0 1
The usual way is to set up a confusion matrix to compare actual and predicted values in binary classification problems. In the case of this small data set one can easily see that there is only one false negative result. So the code seems to work pretty well, but this encouraging result can be deceptive due to the very small number of observations.
library(gmodels)
actual <- rorIn$Readmit
predicted <- predict(RoRmodel,rorIn[,trainvars])
CrossTable(actual,predicted, prop.chisq=FALSE,prop.r=FALSE)
# Total Observations in Table: 8
#
#
# | predicted
# actual | 0 | 1 | Row Total |
#--------------|-----------|-----------|-----------|
# 0 | 4 | 0 | 4 |
# | 0.800 | 0.000 | |
# | 0.500 | 0.000 | |
#--------------|-----------|-----------|-----------|
# 1 | 1 | 3 | 4 |
# | 0.200 | 1.000 | |
# | 0.125 | 0.375 | |
#--------------|-----------|-----------|-----------|
# Column Total | 5 | 3 | 8 |
# | 0.625 | 0.375 | |
#--------------|-----------|-----------|-----------|
On a larger data set it would be useful, if not necessary, to separate the set into training data and test data. There is a lot of good literature on machine learning that will help you in fine-tuning the model and its predictions.
I have the following data frame in R that has overlapping data in the two columns a_sno and b_sno
a_sno<- c(4,5,5,6,6,7,9,9,10,10,10,11,13,13,13,14,14,15,21,21,21,22,23,23,24,25,183,184,185,185,200)
b_sno<-c(5,4,6,5,7,6,10,13,9,13,14,15,9,10,14,10,13,11,22,23,24,21,21,25,21,23,185,185,183,184,200)
df = data.frame(a_sno, b_sno)
If you take a close look at the data you can see that the 4,5,6&7 intersect/ overlap and I need to put them into a group called 1.
Like wise 9,10,13,14 into group 2, 11 and 15 into group 3 etc.... and 200 is not intersecting with any other row but still need to be assigned its own group.
The resulting output should look like this:
---------
group|sno
---------
1 | 4
1 | 5
1 | 6
1 | 7
2 | 9
2 | 10
2 | 13
2 | 14
3 | 11
3 | 15
4 | 21
4 | 22
4 | 23
4 | 24
4 | 25
5 | 183
5 | 184
5 | 185
6 | 200
Any help to get this done is much appreciated. Thanks
Probably not the most efficient solution but you could use graphs to do this:
#sort the data by row and remove duplicates
df = unique(t(apply(df,1,sort)))
#load the library
library(igraph)
#make a graph with your data
graph <-graph.data.frame(df)
#decompose it into components
components <- decompose.graph(graph)
#get the vertices of the subgraphs
result<-lapply(seq_along(components),function(i){
vertex<-as.numeric(V(components[[i]])$name)
cbind(rep(i,length(vertex)),vertex)
})
#make the final dataframe
output<-as.data.frame(do.call(rbind,result))
colnames(output)<-c("group","sno")
output