I am currently analysing a rather large dataset (22k+records) and am having some trouble getting the data into a wide format (with one row corresponding to each observation, and columns representing variables).
The data came in two CSV files, one giving demographics and the other giving participants probability ratings to a number of questions. Both of these CSV files were in long format.
I have used the reshape (and reshape2 for speed) packages to attempt to solve my problem. The specific issue i am having is the following.
I have the participants probability ratings in the following form (after one successful reshape).
dtf <- read.csv("http://dl.dropbox.com/u/8566396/foobar.csv")
Now, the format i would like my data to be in is as follows:
User ID Qid1, ....Qid255 Time, with the probabilities for each question in the questions corresponding column.
I have tried a loop and apply to put the values into a new data frame, and many variations of melt and cast. I have also tried the base reshape function, but all to no avail.
In the past, i've always edited my CSV files directly, but this is not an option with the size of this file (my laziness when it comes to data manipulation within R has come back to haunt me).
Any advice or solution you can give to avoid me having to do this by hand would be greatly appreciated.
Your dataset has 6 rows, 3 of which have the column "variable" equal to "probability" and 3 of which have that column equal to "time". You want to have probability be the value of each, and time be added onto the right.
I think there's a difficulty in making this work for you because what you want to do isn't clear. You have values for each UID-Time-X### cell, and values for each UID-Prob-X### cell. Therefore, you have to discard information to get it into your preferred format (UID-Time-X### with probabilities as the values). It seems to me like you're treating time as an ID variable, but it's storing values like a content variable.
To avoid discarding any data, your output would have to look something like:
UID Time1 Time2 Time3 Prob1 Prob2 Prob3
Which is simply reshaped wide.
Related
It is hard to explain this without just showing what I have, where I am, and what I need in terms of data structure:
What structure I had:
Where I have got to with my transformation efforts:
What I need to end up with:
Notes:
I've not given actual names for anything as the data is classed as sensitive, but:
Metrics are things that can be measured- for example, the number of permanent or full-time jobs. The number of metrics is larger than presented in the test data (and the example structure above).
Each metric has many years of data (whilst trying to do the code I have restricted myself to just 3 years. The illustration of the structure is based on this test). The number of years captured will change overtime- generally it will increase.
The number of policies will fluctuate, I've just labelled them policy 1, 2 etc for sensitivity reasons and limited the number whilst testing the code. Again, I have limited the number to make it easier to check the outputs.
The source data comes from a workbook of surveys with a tab for each policy. The initial import creates a list of tibbles consisting of a row for each metric, and 4 columns (the metric names, the values for 2024, the values for 2030, and the values for 2035). I converted this to a dataframe, created a vector to be a column header and used cbind() to put this on top to get the "What structure I had" data.
To get to the "Where I have got to with my transformation efforts" version of the table, I removed all the metric columns, created another vector of metrics and used rbind() to put this as the first column.
The idea in my head was to group the data by policy to get a vector for each metric, then transpose this so that the metric became the column, and the grouped data would become the row. Then expand the data to get the metrics repeated for each year. A friend of mine who does coding (but has never used R) has suggested using loops might be a better way forward. Again, I am not sure of the best approach so welcome advice. On Reddit someone suggested using pivot_wider/pivot_longer but this appears to be a summarise tool and I am not trying to summarise the data rather transform its structure.
Any suggestions on approaches or possible tools/functions to use would be gratefully received. I am learning R whilst trying to pull this data together to create a database that can be used for analysis, so, if my approach sounds weird, feel free to suggest alternatives. Thanks
I am currently working with a large amount of data. For testing purposes I am using a smaller batch, but the main point of concern is the sorting of all the data based off of values in one particular column. I have posted a picture below that shows a small portion of my unsorted data. I want to sort the values in row 2 in ascending order along with all other data in those corresponding columns. In other words I don't want to just order row 2, I want to order row 2 and shift all other data with those re-ordered values.
Currently what I do is read in that csv to a data frame (tmpDF).
After that I transpose the data using tmpDF <- t(tmpDF)
Now I take that data and order the second column into ascending order (or at least that is what i think I am doing. ) tmpDF<- tmpDF[order(tmpDF[,1]),]
Re transpose the data to get it back how it was originally, but sorted. Result is shown in picture below "Ordered data result" Keep in mind that the data shown between the unsorted and sorted are different numbers due to my not posting my entire data set.
I have a few questions about this.
1) Am I going about this the correct way? I am not a very experienced programmer, just trying to teach myself R to help out my research efforts.
2) Why are the values such as "102" being represented as "1.01E+02" in my final sorted csv file? I don't believe I am changing type and in the original file they were represented as "102"
3) Why does the value 116 gets ordered before "1.01E+02"?
Turns out I shouldn't have trusted the source of my data. They left duplicate observations and didn't clean the data as well as I assumed. So this question is moot.
I am attempting to merge two data frames. I've done this many times in the past with great success (after weeding out typos). I've been beating my head against the wall with this one. I cannot find the issue. One file has only 6 columns, 4 of which are repeated in the larger file. I need to merge by unique combinations of these 4 columns. For instance, Plant 1 at Transect A at Site X in year 2014 should have only 1 row. Each Transect and Site have unique prefixes assigned to each plant, but I need to subset out by these 4 columns later, so I want to maintain them.
I've tried both cbind() and merge(). In merge I've also used all=true or false, since I know some of the rows are basically populated by NAs only and don't add anything to my analyses.
dat=cbind(dens, df)
dat=cbind(dens, df), by=c("Year", "site", "transect", "PlantID"))
or
dat=merge(dens, df, by=c("PlantID","Year", "site", "transect"), all=F)
These data files are both only just over 7000 observations in length. But when I cbind or merge, I get the same df, which is well over 10,000 observations. I've looked at the output and a good number of the individuals have been quadrupled. I'm sure it's something very simple that I've missed but at this point I need fresh and knowledgeable eyes.
Here is a link to the two data files on Google Drive.
https://drive.google.com/drive/folders/1JQXSadqxQBOXM5AAOFAr-BmuoX9TXKXh?usp=sharing
A couple of things, when you merge you usually only use one primary key to merge on as multiple can be prone to issues. From your description is sounds like the keys you are using are not the same. For instance one dataset has column Col1 and the other has col1 or worse they are different data types, but they appear to be the same on screen. Maybe try taking a small subset of your datasets and trying merging those before throwing the whole process at it and being surprised it doesn't work.
I need to subset my data depending on the content of one factor variable.
I tried to do it with subset:
new <- subset(data, original$Group1=="SALAD")
data is already a subset from a bigger data frame, in original I have the factor variable which should identify the wanted rows.
This works perfectly for one level of the factor variable, but (and I really don´t understand why!!) when I do it with the other factor level "BREAD" it creates the data frame but says "no data available" - so it is empty. I´ve imported the data from SPSS, if this matters. I´ve already checked the factor levels, but the naming should be right!
Would be really grateful for help, I spent 3 hours on this problem and wasn´t able to find a solution.
I´ve also tried other ways to subset my data (e.g. split), but I want a data frame as output.
Do you have advice in general, what is the best way to subset a data frame if I want e.g. 3 columns of this data frame and these should be extracted depending on the level of a factor (most Code examples are only for one or all columns..)
The entire point of the subset function (as I understand it) is to look inside the data frame for the right variable - so you can type
subset(data, var1 == "value")
instead of
data[data$var1 == "value,]
Please correct me anyone if that is incorrect.
Now, in you're case, you are explicitly taking Group1 from the data frame original and using that to subset data - which you say is a subset of original. Based on this, I see no reason to believe (and every reason not to believe) that the elements of original$Group1 will align with the rows of data. If Group1 is defined within data, why not just use the copy defined there - which is aligned correctly? If not, you need to be very explicit about what you are trying to accomplish, so that you can ensure that things are aligned correctly.
I am trying to change my data from long to wide format. It is a factorial design with one between subject and two within subject variables.
My data:
https://drive.google.com/file/d/0B9lnMw6dkH9KZUZKQkh4M3BIbGM/view?usp=sharing
When I try
library(reshape2)
data.wide<- dcast(correct.anal,group+subnum~speed+int, value.var="corr")
on the data, it says
Aggregation function missing: defaulting to length
I do not have duplicate values though so I do not understand what I need to do.
What I want to achieve is to get from my current data to output one line per subject with 22 columns (subnum, group and the twenty combinations).
Can anyone help with that?
Perhaps this can help:
data.wide<- dcast(correct.anal,group+subnum~speed+int,fun.aggregate=mean, value.var="corr")
I just add the fun.aggregate=mean to average the duplicates.