I am analyzing a data set which is feedback from teachers. Each line in the data frame is a teacher, each of their answers is a variable, however I've run into a problem inputting the year level for each teacher as a lot of the teachers teach multiple grades.
eg:
Teacher Year
a 1
b 3
c 1/2
d 7
e 3/4
How can I enter this data into an excel sheet and then into R and analyse it usefully? I've never dealt with a variable before which contains multiple options on the same row.
Suppose you already have this data in R in an object called teacher_data. I will show you the way to deal with such responses that I have seen most commonly employed: you create additional columns so that each answer gets its own cell via the convenient tidyr function separate().
library(tidyr)
separate(teacher_data, col = "Year", into = paste0("Year", 1:2), sep = "/")
Here's the result:
Teacher Year1 Year2
1 a 1 <NA>
2 b 3 <NA>
3 c 1 2
4 d 7 <NA>
5 e 3 4
How you then use those columns kind of depends on what sort of answer you're trying to ask with the data. This part of your question is probably best asked at the sister site Cross Validated (Stack Exchange for statistics).
As far as Excel goes, I would not even deal with Excel as an intermediate step; it's just unnecessary. If you write the data out when you're done into a CSV, Excel can read CSVs just fine:
write.csv(teacher_data, file = "teacher_data.csv", row.names = FALSE)
Also, just so you know, I put your data into R via the following:
teacher_data <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
Teacher Year
a 1
b 3
c 1/2
d 7
e 3/4")
Related
Often, the data from multiple response survey items are structured without sufficient information to make tidying very easy. Specifically, I have a survey question in which respondents pick one or more of 8 categorical items. The resulting dataframe has up to 8 strings separated by commas. Some cells might have two, four or none of the 8 options separated by commas. The eighth item is "Other" and may be populated with custom text.
Incidentally, this is a typical format for GoogleForms multiple response data.
Below are example data. The third and last rows include a unique response for the eighth "other" option:
structure(list(actvTypes = c(NA, NA, "Data collection, Results / findings / learnings, ate ants and milkweed",
NA, "Discussion of our research question, Planning for data collection",
"Data analysis, Collected data, apples are yummy")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
I'd like to make a set of 8 new columns into which the responses are recorded as either 0 or 1. How can this be done efficiently?
I have a solution but it is cumbersome. I started by creating new columns for each of the response options:
atypes<- c("atype1","atype2","atype3","atype4","atype5","atype6","atype7","atype8")
log[atypes]<-NA
Next, I wrote eight ifelse statements; the format for the first seven is shown below:
log$atype7<-ifelse(str_detect(log$actvTypes,"Met with non-DASA team member (not data collection)"),1,0)
For the "other" response option, I used a list of strings and a sapply solution:
alloptions<-c('Discussion of our research question' ,'Planning for data collection' ,'Data analysis','Discussion of results | findings | learnings' ,'Mid-course corrections to our project' ,'Collected data' ,'Met with non-DASA team member (not data collection)' )
log$atype8<-sapply(log$actvTypes, function(x)
ifelse(
any(sapply(alloptions, str_detect, string = x)==TRUE),1,0) )
How might this coding scheme be more elegant? Perhaps sapply and using an index?
Depending on what you're ultimately trying to do, the following could be helpful:
library(tidyverse)
df %>%
rownames_to_column(var = "responder") %>%
separate_rows(actvTypes, sep = ",") %>%
mutate(actvTypes = fct_explicit_na(actvTypes)) %>%
count(actvTypes)
# # A tibble: 9 x 2
# actvTypes n
# <fct> <int>
# 1 " apples are yummy" 1
# 2 " ate ants and milkweed" 1
# 3 " Collected data" 1
# 4 " Planning for data collection" 1
# 5 " Results / findings / learnings" 1
# 6 Data analysis 1
# 7 Data collection 1
# 8 Discussion of our research question 1
# 9 (Missing) 3
Taking note of what this looks like right before the call to count() -- grouping up the "other" category should be trivial if you know the "non-other" categories beforehand. You may also want to look at what this looks like after the call to separate_rows().
This question already has answers here:
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 5 years ago.
This one is a doozy. I've been trying to figure this out for a while, but I keep hitting the wall. So, I'm crowd sourcing this in the name of science.
A Brief Introduction
I have about 93 files with unique names in a directory. I read this files in to a list using r.
files.measurements <- as.character(list.files(path = "~/measurements/", full.names = TRUE))
So, what this is doing is just finding the names of all files in the directory. All these files are .csv. Saves me a lot of hassle.
I then read the names of the files.
measurements.filenames <- gsub(".csv", "", basename(files.measurements))
The reason to read these files is because each file name represents the name of the measurement. The same item in the file may or may not exist in multiple files.
For Example
There are 5 file names, viz., NameA, NameB, NameC, NameD, NameE. Each file has 8 column names: id, name, sex, dob, ..., measurement. (This name is same for each file name)
Of course, the id is unique, but may or may not exist int NameB, if it exists in NameA.
Need
So, what I need to do is merge these 93 files to a single dataframe such that the dataframe contains id, name, sex, dob, ... and instead of measurement the name of the file - NameA, for example. The value should be the same for the same id, and if the id doesn't exist, rbind to the dataframe with additional column, else if the id exists, just add the measurement to the column with the new column name - NameB.
Can you please help? This is to gather the data for cardiovascular and HIV diseases for research.
EDIT
DATA
NameA
id gender dob status date measurement
1 F 5/24/1942 Rpt 1/12/2018 2.9
2 F 12/1/2017 Rpt 1/12/2018 0.622
3 M 11/15/1957 Rpt 1/11/2018 3.6
4 M 5/17/1947 Rpt 1/11/2018 3.5
5 F 7/17/1955 Rpt 1/11/2018 2.7
NameB
id gender dob status date measurement
1 F 5/24/1942 Rpt 1/12/2018 3.5
2 F 12/1/2017 Rpt 1/12/2018 2.5
8 M 11/15/1957 Rpt 1/11/2018 1.9
10 M 5/17/1947 Rpt 1/11/2018 0.8
11 F 7/17/1955 Rpt 1/11/2018 1.2
Explanation
So, as you see, all the columns in both tables are the same, but the last measurement is different. Please ignore gender, dob, status and date columns for now. Let's focus on id and measurement. As you can see, id 1 and 2 are in both tables NameA and NameB. if that's the case, then measurement from NameB should be added to the dataframe right next to the measurement from NameB with name (like NameB-measurement). And for all the id's that doesn't exist in NameA from NameBshould be added as new row withmeasurementfromNameAas blank butNameB-measurement` added.
I know it's convoluted, but that's how the researchers gave me the data. I need to clean this up somehow.
Try the following:
# collecting all the csv files in a given folder
files.measurments <- base::list.files(path = ".", include.dirs = FALSE)
# reading all csv files into a list of dataframes
files.combined <- purrr::map(files.measurements, read.csv)
# combining the individual dataframes into a single dataframe
finaldf <- plyr::rbindfill(files.combined)
After going to find how to summarize a DataFrame I did it.
I can see the results in my Console which is what is shown below after the first two lines of code
byTue <- group_by(luckyloss.3,L_byUXR)
( sumMon <- summarize(byTue,count=n()) )
Below is what I see on the Console It feels good because it shows I got what I was looking for
The results below come from a column of 234 rows which has many values repeated.
So this I did a summarise of the 234 rows where in the case of ANA comes 8 times, ARI 14 and so on
# A tibble: 30 × 2
L_byUXR count
<chr> <int>
1 ANA 8
2 ARI 14
3 ATL 16
4 BAL 4
5 BOS 6
6 CHA 12
7 CHN 8
8 CIN 10
9 CLE 4
10 COL 8
# ... with 20 more rows
What I want is to have this output of 30 rows by two columns in a way I can take it to a word document or could even be HTML
I tried to do a write(byTUE.csv) but what I received was the list of 234 rows of the original data frame. It's like the summarise disappeared, I have checked other ways like markdown or create new files tried to see if the knitr package could help but nothing.
library(stringi) # ONLY NECESSARY FOR DATA SIMULATION
library(officer) # <<= install this
library(tidyverse)
Simulate some data:
set.seed(2017-11-18)
data_frame(
L_byUXR = stri_rand_strings(30, 3, pattern="[A-Z]"),
count = sample(20, 30, replace=TRUE)
) -> sumMon
Start a new Word doc and add the table, saving to a new doc:
read_docx() %>% # a new, empty document
body_add_table(sumMon, style = "table_template") %>%
print(target="new.docx")
I kept looking for an answer and found the "stargazer" package for R, which allowed me to get the result of the dataframe as a text which can be further edited
When you write the R instruction, in "out = ", name the file you want as output and stargazer will place it there for you in your session's folder
The instruction I used was:
stargazer(count, type = "text", summary = FALSE, title="Any Title", digits=1, out="table1.txt")
Even though I found the answer I could not have done it without the help of hrbrmstr who showed me there was a package do do it, I just needed to work more on it
I have been working on a dummy dataset recently and i found out that the data provided to me was all in single line. A similiar example for the same is depicted as follows:
Name,Age,Gender,Occupation A,10,M,Student B,11,M,Student C,11,F,Student
i want to import the data and obtain an output as follows:
Name Age Gender Occupation
A 10 M Student
B 11 M Student
C 12 F Student
a case may arise that a value might be missing. a logic is required to import such data. Can anyone help me out to build a logic behind the import of such data sets.
i tried the normal import but it really didn't helped. just imported the file by read.csv() function and it didn't gave me an expected result.
EDIT: what if the data is like:
Name,Age,Gender,Occupation ABC XYZ,10,M,Student B,11,M,Student C,11,F,Student
and i want an output like:
Name Age Gender Occupation
ABC XYZ 10 M Student
B 11 M Student
C 12 F Student
You could read your file in with readLines, turn spaces into line breaks, and then read it with read.csv:
# txt <- readLines("my_data.txt") # with a real data file
txt <- readLines(textConnection("Name,Age,Gender,Occupation A,10,M,Student B,11,M,Student C,11,F,Student"))
read.csv(text=gsub(" ","\n",txt))
output
Name Age Gender Occupation
1 A 10 M Student
2 B 11 M Student
3 C 11 F Student
If you have millions of records, you will probably want to speed up this process, so I suggest using data.table's fread instead of read.csv, which can also take a shell command to pre-process the file before reading in R, and sed will be a lot faster then doing the string manipulation in R.
Eg if you have this CSV stored at /tmp/x.csv, you can try something like:
> data.table::fread("sed 's/ /\\n/g' /tmp/x.csv")
Name Age Gender Occupation
1: A 10 M Student
2: B 11 M Student
3: C 11 F Student
I have two data sets. One has 2 million cases (individual donations to various causes), the other has about 38,000 (all zip codes in the U.S.).
I want to sort through the first data set and tally up the total number of donations by zip code. (Additionally, the total for each zip code will be broken down by cause.) Each case in the first data set includes the zip code of the corresponding donation and information about what kind of cause it went to.
Is there an efficient way to do this? The only approach that I (very much a novice) can think of is to use a for ... if loop to go through each case and count them up one by one. This seems like it would be really slow, though, for data sets of this size.
edit: thanks, #josilber. This gets me a step closer to what I'm looking for.
One more question, though. table seems to generate frequencies, correct? What if I'm actually looking for the sum for each cause by zip code? For example, if the data frame looks like this:
dat3 <- data.frame(zip = sample(paste("Zip", 1:3), 2000000, replace=TRUE),
cause = sample(paste("Cause", 1:3), 2000000, replace=TRUE),
amt = sample(250:2500, 2000000, replace=TRUE))
Suppose that instead of frequencies, I want to end up with output that looks like this?
# Cause 1(amt) Cause 2(amt) Cause 3(amt)
# Zip 1 (sum) (sum) (sum)
# Zip 2 (sum) (sum) (sum)
# Zip 3 (sum) (sum) (sum)
# etc. ... ... ...
Does that make sense?
Sure, you can accomplish what you're looking for with the table command in R. First, let's start with a reproducible example (I'll create an example with 2 million cases, 3 zip codes, and 3 causes; I know you have more zip codes and more causes but that won't cause the code to take too much longer to run):
# Data
set.seed(144)
dat <- data.frame(zip = sample(paste("Zip", 1:3), 2000000, replace=TRUE),
cause = sample(paste("Cause", 1:3), 2000000, replace=TRUE))
Please note that it's a good idea to include a reproducible example with all your questions on Stack Overflow because it helps make sure we understand what you're asking! Basically you should include a sample dataset (like the one I've just included) along with your desired output for that dataset.
Now you can use the table function to count the number of donations in each zip code, broken down by cause:
table(dat$zip, dat$cause)
# Cause 1 Cause 2 Cause 3
# Zip 1 222276 222004 222744
# Zip 2 222068 222791 222363
# Zip 3 221015 221930 222809
This took about 0.3 seconds on my computer.
could this work?-
aggregate(amt~cause+zip,data=dat3,FUN=sum)
cause zip amt
1 Cause 1 Zip 1 306231179
2 Cause 2 Zip 1 306600943
3 Cause 3 Zip 1 305964165
4 Cause 1 Zip 2 305788668
5 Cause 2 Zip 2 306306940
6 Cause 3 Zip 2 305559305
7 Cause 1 Zip 3 304898918
8 Cause 2 Zip 3 304281568
9 Cause 3 Zip 3 303939326