Filling in missing data and transposing

 Chong Ho (Alex) Yu, Ph.D.

Problem

The raw data are structured in the way that the same subject occupies multiple rows, as shown in the following table. This is called the tall structure, as opposed to the wide structure, in which each user takes a row and each item takes a column. In this example, there are five items and Question 5 is skipped by all examinees while the other items are skipped by only some of the examinees.

 User ID Item ID Score 123 Q1 4 123 Q2 2 123 Q3 3 123 Q4 4 124 Q1 3 124 Q2 4 124 Q4 4 125 Q2 2 125 Q3 1 125 Q4 4

The data should be transposed into the following. What should be done?

 User ID Q1 Q2 Q3 Q4 Q5 123 4 2 3 4 . 124 3 4 . 4 . 125 . 2 1 4 .

Solution

It is easy to manually convert the data when there are only five items and three subjects. However, this process will be extremely tedious when there are 50 items and 3,000 examinees. Before transposing, the missing data must be filled in. The following SAS program shows you how to fill in the missing and then transpose the data.
 /* Create a file to retain unqiue IDs only */ data idfile; set rawdata;     proc sort nodupkey; by userid; run; /* Create a file to make each subject carries 5 rows (items) and assign a faked ID to each row. The faked ID is just the ascending order (1-5) */ data idonly; set idfile;     array XX(5) X1-X5;         do fid=1 to 5;         x=xx[fid];         output;     end;     drop x1-x5 x itemid rawscore;     proc sort; by fid; run; /* Import the raw data and sort them for the merging procedure in a later stage */ data score; set rawdata;     proc sort; by userid itemid; run; /* Import a file listing the faked ID and all item IDs, including those that no one answered */ data item; set formitem;     proc sort; by fid; run; /* merge the ID file and item file so that every subject carries 5 items regardless of whether they have missing data */ data iditem; merge idonly item; by fid;     proc sort; by userid itemid; run; /* merge the raw data and the iditem file, and now even missing values are showed up. */ /* FID can be used to check whether every subject has 5 items */ data realdata; merge score iditem; by userid itemid; run;     proc sort; by userid itemid; run; /* transpose the data to make each row represents a subject with answered and non-answered items */ data transpose; set realdata; by userid itemid;     length i1-i5 \$8;     array scores s1-s5;     array inames \$ i1-i5;         retain s1-s5 i1-i5 n;         if first.userid then n = 1;         scores(n) = rawscore;         inames(n) = itemid;         n = n+1;         if last.userid then output; run;

Thanks Dr. Marlyin Thompson for her input in solving this problem.