I have a dataframe which is as follow:
Name Condition NumMessage
Table 1 NULL 80
Table 1 Fair 20
Table 1 Good 60
Table 1 Ideal 50
Table 1 Great 80
Table 2 NULL 80
Table 2 Fair 100
Table 2 Good 90
Table 2 Ideal 50
Table 2 Great 40
and so on. I tried to create a frequency table for the number of message for each table.
data = as.data.frame(prop.table(table(dataframe$Name)))
colnames(data) = c('Table Name', 'Frequency')
data
but this returns same frequency for all tables. For example, Table 1 contains total of 290 messages where Table 2 contains 360 messages. But the above code gives same frequency for both tables.
Also when I tried to get frequency of each condition for each table, I also got same numbers across tables.
prop.table(table(dataframe$Condition, dataframe$Name))
NULL | some value
Fair | some value
Good | some value
Ideal | some value
Great | some value
is this the correct way to get the frequency of total number of messages for each table and frequency of conditions for each table?
xtabs is the base R way to get a summed contingency table.
prop.table(xtabs(NumMessage ~ ., data=df), 1)
# Condition
#Name Fair Good Great Ideal NULL
# Table1 0.06896552 0.20689655 0.27586207 0.17241379 0.27586207
# Table2 0.27777778 0.25000000 0.11111111 0.13888889 0.22222222
We could try with acast
library(reshape2)
prop.table(acast(df1, Name~Condition, value.var='NumMessage', sum),1)
# Fair Good Great Ideal NULL
#Table 1 0.06896552 0.2068966 0.2758621 0.1724138 0.2758621
#Table 2 0.27777778 0.2500000 0.1111111 0.1388889 0.2222222
If we call your dataset df, then perhaps this is what you are looking for?
df1 = subset(df, Name=='Table1')
df2 = subset(df, Name=='Table2')
prop.table(df1[,3])
prop.table(df2[,3])
aggregate(df1$NumMessage, list(df1$Name), sum)
aggregate(df1$NumMessage, list(df2$Name), sum)
You can always tackle this with the package sqldf.
library(sqldf)
Name<-c('Table1','Table1','Table1','Table1','Table1','Table2','Table2','Table2','Table2','Table2')
Cond<-c(NA,'Fair','Good','Ideal','Great',NA,'Fair','Good','Ideal','Great')
Msg<-c(80,20,60,50,80,80,100,90,50,40)
df<-data.frame(Name,Cond,Msg)
Your dataframe:
Name Cond Msg
1 Table1 <NA> 80
2 Table1 Fair 20
3 Table1 Good 60
4 Table1 Ideal 50
5 Table1 Great 80
6 Table2 <NA> 80
7 Table2 Fair 100
8 Table2 Good 90
9 Table2 Ideal 50
10 Table2 Great 40
Now simply use this statement for sum of messages for each table:
sqldf("select Name, sum(Msg) from df group by Name ")
Name sum(Msg)
1 Table1 290
2 Table2 360
If you want sum of messages for each condition then use:
sqldf("select Cond, sum(Msg) from df group by Cond ")
Cond sum(Msg)
1 <NA> 160
2 Fair 120
3 Good 150
4 Great 120
5 Ideal 100
Hope that helps.
Related
I have a sqlite3 table containing students' marks for an assingment. Below is a sample data of the table
Id
Name
Marks
1
Mark
87
2
John
50
3
Adam
65
4
Cindy
68
5
Ruth
87
I would like to create a new column 'Rank', giving the students a rank according to marks scored.
These are 2 main criterias to follow:
If both students have the same marks, their rank would be the same
The total rank number would be the same as the total number of students. For example if there are two student with Rank 1, the next student below them would be Rank 3.
Below is a sample output of what i need
Id
Name
Marks
Rank
1
Mark
87
1
2
John
50
5
3
Adam
65
4
4
Cindy
68
3
5
Ruth
87
1
This is the code that i have at the moment
import sqlite3
conn = sqlite3.connect('students.sqlite')
cur = conn.cursor()
cur.execute('ALTER TABLE student_marks ADD Rank INTEGER')
conn.commit()
If you are using a recent version of SQLite, then you should probably avoid the update and just use the RANK() analytic function:
SELECT Id, Name, Marks, RANK() OVER (ORDER BY Marks DESC, Id) "Rank"
FROM student_marks
ORDER BY Id;
Although I found few discussions but couldn't find a proper solution within dplyr.
My main table consists of more than 50 columns and have 15 lookup tables. Each lookup tables has around 8-15 columns. I have multiple lookups to perform and since it becomes really messy with select statements (either by selecting or removing with a minus), I would like to be able to replace column values on the fly.
Is this possible using dplyr? I have provided below just a sample data for better understanding.
I would like to do VLOOKUP (like excel) with city in table with lcity in lookup and replace values of city with newcity.
> table <- data.frame(name = c("a","b","c","d","e","f"), city = c("hyd","sbad","hyd","sbad","others","unknown"), rno = c(101,102,103,104,105,106),stringsAsFactors=FALSE)
>lookup <- data.frame(lcity = c("hyd","sbad","others","test"),newcity = c("nhyd","nsbad","nothers","ntest"),rating = c(10,20,40,55),newrating = c(100,200,400,550), stringsAsFactors = FALSE)
> table
name city rno
1 a hyd 101
2 b sbad 102
3 c hyd 103
4 d sbad 104
5 e others 105
6 f unknown 106
> lookup
lcity newcity rating newrating
1 hyd nhyd 10 100
2 sbad nsbad 20 200
3 others nothers 40 400
4 test ntest 55 550
My output table should be
name city rno
1 a nhyd 101
2 b nsbad 102
3 c nhyd 103
4 d nsbad 104
5 e nothers 105
6 f <NA> 106
I have tried below code for updating values on the fly, but this creates another dataframe/table instead of a character vector
table$city <- select(left_join(table,lookup,by=c("city"="lcity")),"newcity")
One solution could be:
Note: The data shown by OP and created with commands are different for lookup. I have used the data shown for lookup in tabular format by OP.
library(dplyr)
# Data from OP
table <- data.frame(name = c("a","b","c","d","e","f"),
city = c("hyd","sbad","hyd","sbad","others","unknown"),
rno = c(101,102,103,104,105,106),stringsAsFactors=FALSE)
lookup <- data.frame(lcity = c("hyd","sbad","others","test"),
newcity = c("nhyd","nsbad","nothers","ntest"),
rating = c(10,20,40,55),newrating = c(100,200,400,550),
stringsAsFactors = FALSE)
table %>%
inner_join(lookup, by = c("city" = "lcity")) %>%
mutate(city = newcity) %>%
select(name, city, rno)
name city rno
1 a nhyd 101
2 b nsbad 102
3 c nhyd 103
4 d nsbad 104
5 e nothers 105
I am trying to update a value in a data frame that is numeric when it is above a certain value due to input error. The value should be in the hundreds but, on occasion is in the thousands as it has an extra zero.
Data Frame is called df and the column is called Value1
Value1 (sample values)
650
6640
550
The value for 7650 should be 765. I am trying to use the following:
df$Value1[df$Value1>1000] <- df$Value1/10
This is generating very odd results. I end up not having values greater than 1000 but, a value of 6640 became 74.1 instead of 664 as I expected.
Any suggestions?
Thanks in advance
Here's how to do this in one line, without having to compute the target row indexes twice:
df$Value1[ris <- which(df$Value1>1000)] <- df$Value1[ris]/10;
df;
## Value1
## 1 650
## 2 664
## 3 550
Data
df <- data.frame(Value1=c(650L,6640L,550L));
Or we can use data.table (data from #bgoldst's post)
library(data.table)
setDT(df)[Value1 > 1000, Value1 := Value1/10]
df
# Value1
#1: 650
#2: 664
#3: 550
Here is one way :
#Sample data frame
d1
Value1
1 650
2 6640
3 550
d1$Value1 = as.numeric(substr(d1$Value1,1,3))
#result
d1
Value1
1 650
2 664
3 550
I'm relatively new in R and learning. I have the following data frame = data
ID grade Test_Date
1 56 01-25-2012
1 63 02-21-2016
1 73 02-31-2016
2 41 12-23-2015
2 76 01-07-2016
3 66 02-08-2016
I am looking to count the number of people (in this case only two unique individuals) who passed their tests after multiple attempts(passing is defined as 65 or over). So the final product would return me a list of unique ID's who had multiple counts until their test scores hit 65. This would inform me that approx. 66% of the clients in this data frame require multiple test sessions before getting a passing grade.
Below is my idea or concept more or less, I've framed it as an if statement
If ID appears twice
count how often it appears, until TEST GRADE >= 65
ifelse(duplicated(data$ID), count(ID), NA)
I'm struggling with the second piece where I want to say, count the occurrence of ID until grade >=65.
The other option I see is some sort of loop. Below is my attempt
for (i in data$ID) {
duplicated(datad$ID)
count(data$ID)
Here is where something would say until =65
}
Again the struggle comes in how to tell R to stop counting when grade hits 65.
Appreciate the help!
You can use data.table:
library(data.table)
dt <- fread(" ID grade Test_Date
1 56 01-25-2012
1 63 02-21-2016
1 73 02-31-2016
2 41 12-23-2015
2 76 01-07-2016
3 66 02-08-2016")
# count the number of try per ID then get only the one that have been successful
dt <- dt[, N:=.N, by=ID][grade>=65]
# proportion of successful having tried more than once
length(dt[N>1]$ID)/length(dt$ID)
[1] 0.6666667
Another option, though the other two work just fine:
library(dplyr)
dat2 <- dat %>%
group_by(ID) %>%
summarize(
multiattempts = n() > 1 & any(grade < 65),
maxgrade = max(grade)
)
dat2
# Source: local data frame [3 x 3]
# ID multiattempts maxgrade
# <int> <lgl> <int>
# 1 1 TRUE 73
# 2 2 TRUE 76
# 3 3 FALSE 66
sum(dat2$multiattempts) / nrow(dat2)
# [1] 0.6666667
Here is a method using the aggregate function and subsetting that returns the maximum score for testers that took the the test more than once starting from their second test.
multiTestMax <- aggregate(grade~ID, data=df[duplicated(df$ID),], FUN=max)
multiTestMax
ID grade
1 1 73
2 2 76
To get the number of rows, you can use nrow:
nrow(multiTestMax)
2
or the proportion of all test takers
nrow(multiTestMax) / unique(df$ID)
data
df <- read.table(header=T, text="ID grade Test_Date
1 56 01-25-2012
1 63 02-21-2016
1 73 02-31-2016
2 41 12-23-2015
2 76 01-07-2016
3 66 02-08-2016")
I am trying to solve the following inconvenience when trying to export a table consisting of factor levels. Here is the code to generate the sample data, and a table from it.
data <- c(sample('A',30,replace=TRUE), sample('B',120,replace=TRUE),
sample('C',180,replace=TRUE), sample('D',70,replace=TRUE))
library(Publish)
univariateTable(~data)
The default output of the univariateTable is by levels (From A through D):
Variable Levels Value
1 data A 30 (7.5)
2 B 120 (30.0)
3 C 180 (45.0)
4 D 70 (17.5)
How can I change this so that the output is based on the value instead? I mean, the first row being the largest number (and percentage) and the last low being the lowest, like this:
Variable Levels Value
1 data C 180 (45.0)
2 B 120 (30.0)
3 D 70 (17.5)
4 A 30 (7.5)
Assuming that the "Publish" package is the one installed from github, we extract the numbers before the ( using sub, order it and use it to order the "xlevels" and "summary.totals".
#library(devtools)
#install_github("TagTeam/Publish")
library(Publish)
Out <- univariateTable(~data)
i1 <- order(as.numeric(sub('\\s+.*', '',
Out$summary.totals$data)), decreasing=TRUE)
Out$xlevels$data <- Out$xlevels$data[i1]
Out$summary.totals$data <- Out$summary.totals$data[i1]
Out
# Variable Level Total
#1 data C 180 (45.0)
#2 B 120 (30.0)
#3 D 70 (17.5)
#4 A 30 (7.5)
data
set.seed(24)
data <- c(sample('A',30,replace=TRUE), sample('B',120,replace=TRUE),
sample('C',180,replace=TRUE), sample('D',70,replace=TRUE))